Boerhaave 2025: Using R for data analysis

Authors
Affiliation

Ramin Monajemi

Leiden University Medical Center, Biomedical Data Sciences

Szymon M. Kiełbasa

Leiden University Medical Center, Biomedical Data Sciences

Published

January 20, 2025

Timetable

R and RStudio basics (slot 1)

Time: Thursday, 2025-01-16, 09:00-12:30
Location: LUMC, lecture room CZ-7

Data structures 1/2 (slot 2)

Time: Thursday, 2025-01-16, 13:30-17:00
Location: LUMC, lecture room CZ-7

Data manipulation 1/3 (slot 3)

Time: Friday, 2025-01-17, 09:00-12:30
Location: LUMC, lecture room CZ-7

Data manipulation 2/3 (slot 4)

Time: Friday, 2025-01-17, 13:30-17:00
Location: LUMC, lecture room CZ-7

Extras/Hints: Sampling age distribution and age categories, Grades report, set operations.

Data structures 2/2, Flow control (slot 5)

Time: Monday, 2025-01-20, 09:00-12:30
Location: LUMC, lecture room CZ-7

Graphics (slot 6)

Time: Monday, 2025-01-20, 13:30-17:00
Location: LUMC, lecture room CZ-7

Data manipulation 3/3 (slot 7)

Time: Tuesday, 2025-01-21, 09:00-12:30
Location: LUMC, lecture room CZ-7

Self-Study Assignment (slot 8)

Time: Tuesday, 2025-01-21, 13:30-17:00
Location: LUMC, lecture room CZ-7

R and RStudio basics

Console as a calculator (lecture) 👩‍🏫

Gives: “arithmetic”, “callingFunctions”, “help”, “console”, “prompt”, “comment”, sqrt, log10, exp, sin, sum.

Installation

Go to RStudio download page.
Follow RStudio Desktop (and R) installation steps and once installed, start the RStudio program.

Global options

Important preparation of settings in RStudio:

  • Go to the “Tools” menu and select “Global Options”. A window with “Options” will appear.
  • Select “General” section, “Basic” panel and look at the “Workspace” part.
  • Disable option “Restore .RData into workspace at startup”.
  • At “Save workspace to .RData on exit” select “Never”.
  • Finally, press “Apply”.

Console, expressions

Your screen will be divided into several parts (panes).
In the console pane you will see the > sign, called the prompt.
After the prompt you can type expressions in the R language.
For example, type there a mathematical expression: 2+2 and press Enter.
You should see the result: 4.

R performs calculations according to the rules of arithmetic operations.
The order of operations is:

  • parentheses (...)
  • exponentiation ^
  • multiplication * and division /
  • addition + and subtraction -

Use . (dot) to separate the integer part from the decimal part of a number: for example 3.5.
You may also use the exponent notation: 35e-1 is the same as 3.5 (e-1 denotes 10 to the power of -1).

Note that the # sign is used to start a comment in R.
Everything after the # sign till the end of the line is ignored by R.
Spaces are also (mostly) ignored by R. You can use them to make your code more readable

Errros

Observe how R reports errors. Type the following lines:

2+3)      # missing opening parenthesis
2 + * 3   # missing number between the addition and the multiplication operators
0,5       # use dot instead of comma to denote one half

Multiline expressions

When Enter is pressed, but an expression is unfinished, R will show a + sign instead of the > prompt. It means that R is waiting for the rest of the expression. Either type the rest of the expression or press Esc (or Ctrl+C) to cancel the expression. Try the following:

3 +
4

(((((
    1
)))))

Calling functions

More complex operations are provided by functions. For example, the sqrt function calculates the square root of a number. Type sqrt(9) and press Enter.
Call the following functions with the arguments given below:

sqrt(9)     # square root, here: sqrt(9)=sqrt(3^2)=3
log10(100)  # logarithm base 10, here: log10(100)=log10(10^2)=2
exp(1)      # exponential function, here: e^1 = e
sin(pi/2)
sum(1, 2, 3, 4, 5)
sum()

Getting help

In the RStudio window find another pane with the Help tab. Type sqrt in the search box and press Enter. You will see the help page for the sqrt function. The help page contains the description of the function, its arguments, and examples of usage. To get the help, you may also type ?sqrt in the console and press Enter.

The internet usually provides more up-to-date information. Try to search for R sqrt in your favorite search engine.

Numbers and arithmetic (exercise) 📖

Needs: “arithmetic”, “callingFunctions”, “help”, “console”, “comment”, sqrt, log10, exp, sin, sum.
Gives: round, pi.
Practices: “arithmetic”, “help”, “callingFunctions”, “console”, “comment”, sin, round, pi.

Arithmetic operations and their order

In the console, one-by-one type the expressions given below. Each time before you press Enter, think what should be the result. Are the results as expected?

3*4+5
3+4*5
(3+4)*5
3-2
3--2
3-(-2)
-2--2
8/2
8/-2
8/2*2
8/(2*2)
2^5

Real numbers, zero

Check the following expressions (think about the results before you press Enter):

1/2
2.0^-1
1/2 # +1/2
1 / 2
1e3
0.54321e5
1e-1
-1e-3
-1e-6
-1e-6*1000
sin(0)
sin(2*pi)
1-(1/3+1/3+1/3)  # not so interesting
1-1/3-1/3-1/3    # interesting, isn't it?

Rounding numbers

Have a look at the following expressions (and check the manual for the round function):

round( 1/3, 2 )
round( 1/3, 5 )
round( 1/3, 0 )

round( 0.1, 0 )
round( 0.9, 0 )
round( 1.1, 0 )
round( 1.9, 0 )

round( 0.5, 0 )
round( 1.5, 0 ) # surprised?
round( -0.5, 0 )
round( -1.5, 0 )

RStudio projects (lecture) 👨‍🏫

Gives: “RStudioProjects”.

A typical data analysis involves multiple files:

  • Input files with data, fixed before analysis
    (e.g. files in various formats containing tables of data to be analysed).
  • Script files or report files
    (there you develop programs in the R language which analyse the input data and report results).

RStudio uses the concept of a project to organize files of one analysis and separate them from other analyses.
On a disk drive, a project is a directory containing the files of the analysis.
When you open a project, RStudio remembers the files you were working on and the settings you used.

Creating a new project

Follow the subsequent steps to create a new (example) project Learning_R.

From the RStudio menu choose: menu File, New project... and then New directory... and New project....
In the final dialog box:

  • As the new directory name type: Learning_R.
  • In the next option you may select where to create this directory.
  • Press Create project to finalize the process.

Check whether you use the new project

Refer to the following Figure:

  • Look at the top right corner of the RStudio window to see the name of the current project. Learning_R should be the current project.
  • The title line shows the project diectory. This is the current directory and there the files of the project are/will be kept.
  • Note the project file Learning_R.Rproj in the Files panel. Later, when you find this file in a file explorer, once clicked it should start RStudio and open this project.

Copy input data files to the project directory

Although R can be used to analyse data from files located anywhere in the filesystem, it is convenient to have such files in the project directory.
For the following parts of the course you need the file(s): pulseNA.csv.
Download (with a browser) or copy (with the system file explorer) these files to the project directory. Note, that the details of downloading/copying depend on the operating system of your computer.

Once the files are present in the project directory, they should be visible in the Files panel (like the example files pulse.csv and survey.csv below).

R scripts and R Markdown documents (lecture) 👩‍🏫

Needs: “console”, “RStudioProjects”.
Gives: “RScript”, “RMarkdown”.

What is an R script

R script is a simple text file (with extension .R, for example my_script.R) containing expressions written in the R language. When an R script is sourced, the lines of the script are one-by-one copied to the Console and immediately executed. An R script allows to keep R commands saved in a file (without saving, the commands given in the Console are executed and forgotten).

What is an R Markdown document

An R Markdown document (with extension .Rmd, for example my_report.Rmd) is an extension of an R script. It is a format for developing elegant and reproducible reports. It is a simple text file consisting of:

  • Chunks written in R language, where data analysis is performed.
  • Free text written in Markdown notation, possibly containing: sections, lists, tables, hyperlinks, formulae: \(r=\sqrt{x^2+y^2}\), etc.

The Knit operation converts an R Markdown document into a report file. Many report output formats are available: HTML, PDF (printable pages or presentations), Microsoft Word documents, etc.
(For example: this page is written in R Markdown and knitted to HTML)

Follow these links to find more:

Create a new R Markdown document

Note: Before creating a new document, first check whether you work in the Learning_R project.

From the RStudio menu choose File/New File/R Markdown...
In the dialog box you may specify some details of the analysis (later change is possible):

  • Title of the analysis.
    (Note: this is the title of your future report, and not the name of the file where the document gets saved.)
  • Name of the author.
  • Output format (for now, select HTML).

Once accepted, a new document Untitled1 is shown.
The new document is not empty; it contains a demo example written in R Markdown.

Knit an R Markdown document

Knitting is the name of the process of conversion from R Markdown text file to a report in one of the target formats (for example: HTML).

  • Press the Knit button to start the conversion.
  • If the R Markdown document does not have a file name yet, a save dialog box will pop up to ask for the file name.
  • Some R libraries might get automatically installed before the first Knit.
  • The knitting process might take a few seconds and then the final document should appear either in a new window or in the Viewer pane.
    The Figure below shows the result of knitting of the standard example (R Markdown was in file named Lesson_1.Rmd and was knitted to HTML file Lesson_1.html):

Parts of an R Markdown document

An R Markdown document consists of: header, chunks and free text.

The header:

  • Is located at the first lines of the file.
  • Starts and ends with a line containing --- (three minus symbols).
  • May specify title, author, date and possibly other properties of the document.

A chunk:

  • Is a place where code in R language is written. The result of this code will be displayed below the chunk.
  • Starts with a line containing ```{r} (three backticks, then r in curly braces).
  • Ends with a line containing ```
  • Has a gray background (in RStudio).

A free text:

  • Not the header and not a chunk.
  • Has a white background (in RStudio).

Try to locate these elements in the picture shown below:

Editing an R Markdown document

Useful keystrokes:

  • To insert a new chunk you may use menu Code/Insert Chunk or the keyboard shortcut shown in the menu.
  • To run the chunk containing the cursor you may use menu Code/Run Region/Run Current Chunk or the keyboard shortcut shown in the menu.
  • To run the line containing the cursor or currently selected text you may use Ctrl-Enter.

Reading a table from a .csv file (lecture) 👨‍🏫

Files: pulseNA.csv, pulseNA.csv.gz.
Needs: “RMarkdown”.
Gives: read.csv, “pulseNA.csv”, head, $, str.

The experiment

Students in an introductory statistics class (MS212 taught by Professor John Eccleston and Dr Richard Wilson at The University of Queensland, years 1993-1998) participated in a simple experiment.
The students measured their own pulse rate. They were then asked to flip a coin. If the coin came up heads, they were to run in place for one minute. Otherwise they sat without movement for one minute. Then everyone measured their pulse again.

The pulse rates and other physiological and lifestyle data are given in the data table. A slightly modified version of the table is available in the pulseNA.csv file.

Preparation

Make sure that the file pulseNA.csv is available in the working directory (check pulseNA.csv is shown the Files pane).
Work in a newly created R Markdown document, saved in the same workind directory (check that your current .Rmd file is also shown in the Files pane).
Type R expressions shown below in R code chunks (on gray background) in the R Markdown document.
Use Ctrl-Enter to execute the expressions in the console.

Reading the table

The read.csv function (from the base-R package) reads a table from a file in the comma-separated values (.csv) format. The function returns the table (as a data.frame object).
Note: read.csv and read_csv are two different functions providing similar functionality, and they will be discussed later in more detail.

In the example below, the table gets assigned to the variable d. Then, the 3 top rows of the table from d are extracted with the head function and printed.

d <- read.csv( "pulseNA.csv" )
head( d, 3 )
      id     name height weight age gender smokes alcohol exercise   ran pulse1
1 1993_A   Bonnie    173     57  18 female     no     yes moderate FALSE     86
2 1993_B  Melanie    179     58  19 female     no     yes moderate  TRUE     82
3 1993_C Consuelo    167     62  18 female     no     yes     high  TRUE     96
  pulse2 year
1     88 1993
2    150 1993
3    176 1993

The table should have more than 10 columns (id, name, …). If you don’t see separate columns, the file with the table might be broken. Try then to load the file named pulseNA.csv.gz (a compressed version of the pulseNA.csv file; R will unpack it automatically). If you still have problems, consult a teacher.
If the table is too wide, then one row of the table might be shown in multiple lines.

Table properties

A quick overview of the table can be obtained with the str function.
Number of rows and columns is shown, as well as names and types of data in the columns.

str(d)
'data.frame':   110 obs. of  13 variables:
 $ id      : chr  "1993_A" "1993_B" "1993_C" "1993_D" ...
 $ name    : chr  "Bonnie" "Melanie" "Consuelo" "Travis" ...
 $ height  : int  173 179 167 195 173 184 162 169 164 168 ...
 $ weight  : num  57 58 62 84 64 74 57 55 56 60 ...
 $ age     : int  18 19 18 18 18 22 20 18 19 23 ...
 $ gender  : chr  "female" "female" "female" "male" ...
 $ smokes  : chr  "no" "no" "no" "no" ...
 $ alcohol : chr  "yes" "yes" "yes" "yes" ...
 $ exercise: chr  "moderate" "moderate" "high" "high" ...
 $ ran     : logi  FALSE TRUE TRUE FALSE FALSE TRUE ...
 $ pulse1  : int  86 82 96 71 90 78 68 71 68 88 ...
 $ pulse2  : int  88 150 176 73 88 141 72 77 68 150 ...
 $ year    : int  1993 1993 1993 1993 1993 1993 1993 1993 1993 1993 ...

Accessing table columns

With the $ operator, you can access a single column of the table.

d$pulse1                # column `pulse1` as a vector of numbers
  [1]  86  82  96  71  90  78  68  71  68  88  76  74  70  78  69  77  64  80
 [19]  83  78  88  70  78  80  68  70  62  81  78  86  59  68  75  74  60  70
 [37]  80  58  84 104  66  84  65  80  66 104  76  70  66  92  70  63  65  76
 [55]  56  64  60  68  80  65  47  50  80  76  70  76  72  80  76  85  49  76
 [73] 145  83  72  NA  60  80  70  68  78  52  74  75  72  80  84  74  90  61
 [91]  85  78  76  90  64  64  88  64  82  88  74  88  92  76  71 119  90  86
[109]  69  75

Observe, that you may also use the head function to get first elements of that vector:

head( d$pulse1, 10 )    # first 10 elements of the vector from `pulse1` column
 [1] 86 82 96 71 90 78 68 71 68 88

Reading a table from a .csv file, new way (lecture) 👨‍🏫

Files: pulseNA.csv.
Needs: read.csv, “pulseNA.csv”, head.
Gives: “tidyverse”, read_csv, install.packages, library.

The tidyverse package

The read.csv function belongs to the base-R package. It provides functionality as it was described in early years of R development. Functions developed recently are usually more user-friendly and less error-prone. The tidyverse package is a collection of such functions, and this course favors tidyverse functions over base-R functions.

More information about the tidyverse package at tidyverse.org.

Package installation

To install the tidyverse package type once in the console:

install.packages("tidyverse")

Some messages will be shown, and possibly some additional packages will be installed.

Reading a table

The following code reads the pulseNA.csv file into the d variable using the read_csv function from the tidyverse package. The function returns the table as an object of type tibble, which is a modern version of the data.frame with many improvements.

Before you can use the read_csv function, you need to load the tidyverse package with the library function. You do it once, usually in a separate chunk at the top of the script.

library(tidyverse)              # load the tidyverse package, top chunk usually
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
d <- read_csv( "pulseNA.csv" )  # NOTE: read_csv not read.csv !!!
Rows: 110 Columns: 13
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (6): id, name, gender, smokes, alcohol, exercise
dbl (6): height, weight, age, pulse1, pulse2, year
lgl (1): ran

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head( d, 3 )
# A tibble: 3 × 13
  id     name    height weight   age gender smokes alcohol exercise ran   pulse1
  <chr>  <chr>    <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <lgl>  <dbl>
1 1993_A Bonnie     173     57    18 female no     yes     moderate FALSE     86
2 1993_B Melanie    179     58    19 female no     yes     moderate TRUE      82
3 1993_C Consue…    167     62    18 female no     yes     high     TRUE      96
# ℹ 2 more variables: pulse2 <dbl>, year <dbl>

Reading a table (silent)

You can ignore the messages (including the conflict ones). For now, you may prefer to use the following lines, which do not show the messages:

suppressPackageStartupMessages( library(tidyverse) )  # once in the top chunk
d <- read_csv( "pulseNA.csv", show_col_types=FALSE )
head( d, 3 )
# A tibble: 3 × 13
  id     name    height weight   age gender smokes alcohol exercise ran   pulse1
  <chr>  <chr>    <dbl>  <dbl> <dbl> <chr>  <chr>  <chr>   <chr>    <lgl>  <dbl>
1 1993_A Bonnie     173     57    18 female no     yes     moderate FALSE     86
2 1993_B Melanie    179     58    19 female no     yes     moderate TRUE      82
3 1993_C Consue…    167     62    18 female no     yes     high     TRUE      96
# ℹ 2 more variables: pulse2 <dbl>, year <dbl>

Table reading practice in a new R session (exercise) 📖

Files: pulseNA.csv.
Needs: read.csv, read_csv, “tidyverse”, head, str.
Gives: print, as_tibble, as.data.frame, tail.
Practices: head, tail, str, print, as_tibble, as.data.frame.

Practice this exercise in a new, separate RStudio session. Create a new R project, download the pulseNA.csv file, and save it in the project directory. Create a new R Markdown script, save it in the project directory. Finally, develop the content of the script as described below.

Read pulseNA.csv twice:

  • into the bd variable as a data.frame using the read.csv function from base-R,
  • into the td variable as a tibble using the read_csv function from the tidyverse package.
Solution…
suppressPackageStartupMessages( library(tidyverse) )
bd <- read.csv( "pulseNA.csv" )
td <- read_csv( "pulseNA.csv" )

Both loaded objects represent the same table but they are not identical. Try the following code and observe small differences in output of str, head, and tail functions applied to both tables.

A tibble table by default prints only its top 10 rows - observe, how to use print to see the full table content. Also try to use the as_tibble function to convert the bd data.frame to a tibble and the as.data.frame function to convert the td tibble to a data.frame.

str(bd)
str(td)
head(bd)            # top 6 rows
head(td)            # top 6 rows
tail(bd)            # bottom 6 rows
tail(td)            # bottom 6 rows
bd$hei              # ERROR-PRONE in data.frame, a column with name starting with "hei"
td$hei              # ERROR in tibble, column name must much exactly
print(td)
print(td, n=1000)   # print 1000 rows

as_tibble(bd)       # convert data.frame to tibble
as.data.frame(td)   # convert tibble to data.frame

Investigating a table (exercise) 📖

Files: pulseNA.csv.
Needs: read.csv, “pulseNA.csv”, head, $, str.
Gives: colnames, names, dim, nrow, ncol, head, tail, “[rows,]”, min, max, range, fivenum, unique, table.
Practices: str, head, tail, colnames, names, dim, nrow, ncol, “[rows,]”, min, max, range, fivenum, unique, table.

(Note: This exercise illustrates the first steps of analysis of any new dataset.)

Find the pulseNA.csv file and read it into the d data frame (for this file it makes no difference whether you use the base-R read.csv function or the read_csv function from the tidyverse package).

Solution…
d <- read.csv( "pulseNA.csv" )

Investigating table properties

Investigate the structure of the d data frame with str. A lot of information is shown at once.
Let’s try to access some of the information step by step:

  • Learn how to use colnames and names functions to get the names of the table columns.
  • Learn how to use dim, nrow, ncol functions to get the numbers of rows and columns in the table.
Solution…
str(d)

colnames(d)
names(d)
dim(d)
nrow(d)
ncol(d)

Investigating rows

  • Learn how to use the head/tail functions to get the first/last 15 rows of the d data frame.
  • Get the same first/last rows using the single square bracket operator (hint: d[10:15,]). Note, there is a comma after the row index range.
Solution…
head(d, 15)
tail(d, 15)
d[1:15,]
d[(nrow(d)-15+1):nrow(d),]

Investigating columns

Understand the meaning of the columns. Decide, what type of data each columns represents (numerical, categorical, ordinal).

  • For numerical columns: learn how to find the range of values with functions min, max.
    Some columns contain missing data, denoted with NA.
    Learn how to omit missing values when calculating the minimum and maximum.
    Hint: min(d$pulse, na.rm=TRUE).
  • For numerical columns: you do not need to find min and max separately.
    Learn how to use range and fivenum functions to get more information about the range of values in a single function call.
  • For categorical/ordinal columns: learn how to use the unique function to get the unique values of the column.
  • For categorical/ordinal columns: learn how to count the occurrences of each level with the table function.
    Hint: table(d$gender, useNA="always").
  • For ordinal columns: what should be the order of the levels (it will need to be defined manually, later).
Solution…
min(d$height)                  # same with d$weight
max(d$height)
range(d$height)
fivenum(d$height)

min(d$age)                     # there are missing values in `age` column
min(d$age, na.rm=TRUE)
range(d$age, na.rm=TRUE)

range(d$pulse1)                # no missing values
fivenum(d$pulse1)

range(d$pulse2, na.rm=TRUE)    # some missing values
fivenum(d$pulse2)              # fivenum by default omits missing values

unique(d$gender)
table(d$gender, useNA="always")

unique(d$smokes)
table(d$smokes, useNA="always")

table(d$alcohol, useNA="always")
table(d$exercise, useNA="always")
table(d$ran, useNA="always")

range(d$year)                  # it looks like numbers
table(d$year)                  # but should be treated as categorical/ordinal

# ordinal columns:
# - exercise:   low moderate high
# - year:       1993 1995 1996 1997 1998

Simple histograms, scatterplots, barplots, boxplots (exercise) 📖

Files: pulseNA.csv.
Needs: read.csv, “pulseNA.csv”, $, str, min, max, table, fivenum.
Gives: hist, plot, barplot, boxplot, pairs, mean, sd.
Practices: hist, plot, barplot, boxplot, pairs, min, max, mean, sd, fivenum, table.

(Note: This exercise illustrates the first steps of analysis of any new dataset.)

Find the pulseNA.csv file and read it into the d data frame (for this file it makes no difference whether you use the base-R read.csv function or the read_csv function from the tidyverse package).

Solution…
d <- read.csv( "pulseNA.csv" )

Histogram

Pick a numerical column from the table. Find out how to use the hist function to produce a histogram of the column.
Calculate the minimum and maximum values of the column. In a similar way, calculate the mean and the standard deviation (sd). Relate the results to the histogram.

Solution…
hist(d$pulse1)
min(d$pulse1, na.rm=TRUE)
max(d$pulse1, na.rm=TRUE)
mean(d$pulse1, na.rm=TRUE)
sd(d$pulse1, na.rm=TRUE)

Scatterplot(s)

Find out how to use plot to produce a scatterplot of the dependence of height on weight (horizontal). Then, try the following code and discover its meaning:

partOfD <- d[, c("weight", "height", "pulse1", "pulse2") ]
pairs( partOfD )
Solution…
plot(x=d$weight, y=d$height)

# Selects only a few columns and produces a scatterplot matrix
partOfD <- d[, c("weight", "height", "pulse1", "pulse2") ]
pairs( partOfD )

Barplot

The table function can be used to count the occurrences of each level in a categorical column. Count the occurrences of each level in the exercise column. Now, find out how to put the result of the calcluation into the barplot function to produce a barplot of the counts of each exercise value. Observe, how the barplot plot corresponds to the results obtained from table.

Repeat the process for the year column.

Solution…
table(d$exercise)
barplot( table(d$exercise) )

cnts <- table(d$exercise)
barplot( cnts )

table(d$year)
barplot( table(d$year) )

Boxplot

Study the column weight. Use the fivenum function to get the five-number summary of the column. Then, use the boxplot function to produce a boxplot of the weight column. Observe, how the boxplot corresponds to the five-number summary.

Finally, understand the plot generated by the following code:

boxplot( weight ~ gender, data=d )
Solution…
fivenum( d$weight )
boxplot( d$weight )
boxplot( weight ~ gender, data=d )

Data structures 1/2

Variables and assignment (lecture) 👩‍🏫

Needs: “arithmetic”, “console”, “comment”.
Gives: “<-”, “=”, “assignment”.

Variable and its value

A variable stores a value, which can be explicitly provided or a result of a calculation.

Type the following lines in the Console. Observe changes in the Environment pane after each Enter.

gardenWidth <- 5                         # Assign 5 to the variable gardenWidth
gardenLength <- 6                        # Assign 6 to the variable gardenLength

gardenArea <- gardenWidth * gardenLength # Calculate the area of the garden
#                                        # Assign the result to the variable gardenArea

To find out what the value of a variable is, type its name:

gardenArea
[1] 30

A frequent error is to ask for a non-existing variable. Type to see the error:

gggardenArea
Error: object 'gggardenArea' not found

Assignment operator

The arrow symbol <- is called the assignment operator:

x <- 5                    # Assign 5 to x

You may also use the equal sign = for assignment:

y = 6                     # Assign 6 to y

No assignment, no change

Variable assignment is needed to store a new value or to replace a previous one.
When there is no assignment, the result is printed to screen and lost.

For example, the sequence of statements below is the wrong way to “increase x by one”.
The result is calculated but it is not stored.
Type these commands line by line. Observe when value of x changes in the Environment pane:

x <- 5                    # Assign 5 to x
x                         # Prints x (the current value is 5)
[1] 5
x + 1                     # Calculate x+1, prints 6
[1] 6
x                         # Prints x (the value is still 5; no assignment above)
[1] 5

Here is the correct way to “increase x by one”.
Type these commands line by line. Observe when value of x changes in the Environment pane:

x <- 5                    # Assign 5 to x
x                         # Prints x (the current value is 5)
[1] 5
x <- x + 1                # Calculate x+1 and assign the result to x
x                         # Prints x (the value is 6)
[1] 6

Remember: no assignment, no change.

Where variables are stored

The variables are stored in R memory and RStudio shows them in the Environment pane.
When you close RStudio, all variables are lost.

We will learn how to write scripts to store commands used to calculate variable values. The scripts provide reproducibility (they document how the results were calculated).

Note: If you notice that some variables are present in the Environment pane just after you have started RStudio, then it means that you load state of the previous session. Follow the instructions in the Calculator lecture to disable this behavior in Global Options.

Valid variable names

Variable names in R are case sensitive (all other symbols also).
Note that _ is allowed in variable names.
Numbers (not at the beginning) are allowed too.

Here are some examples (and each one is a different variable):

gardenLength <- 6
garden_length <- 6
gArDeNlEnGtH <- 6     # valid, but bad, don't use this
garden.length <- 6    # valid, but may lead to mistakes, don't use this

len1 <- 5             # len1 is good, 1len is wrong
len2 <- 6

Choose meaningful variable names for readable scripts.

Naturally, there are some limitations. Do not use variable names:

  • containing a space (possible, but not recommended)
  • containing any of @#!%$^()-+=!~?,<>}{][
  • for, while, repeat, if, else, TRUE, FALSE, Inf, NA, NaN, … (R reserved names)

Autocomplete names

RStudio provides autocomplete feature, useful for long variable names.

Start typing the first letters of a (variable) name and press Tab.

Vectors (lecture) 👨‍🏫

Needs: “console”, “arithmetic”, round, pi.
Gives: “Vectors”, c, length, sum, mean, median, “:”, seq, TRUE, FALSE, NA, “RelOps”, is.na, class, str, “[…]”, names.

A vector is a container of (multiple) elements:

  • All elements must be of the same type (e.g. all are numbers).
  • The elements are kept at numbered positions (there is the first element, second, …).
  • The elements might be given names (first may be named “Jay”, second “Gloria”, …).

Some examples of a vector (and its type):

  • Heights of students (a vector of numbers, numerical).
  • Names of students (a vector of texts, character).
  • Whether students have siblings (a vector of FALSE/TRUE values, logical).
  • Eye color of students (a vector of values from a limited choice list, factor).

Vector is the primary data structure of the R language.
Note: A table (data.frame or tible discussed later) can be seen as a collection of vectors (columns).

Numerical vector

A vector of numbers you can create with the combine function c. Arithmetic operations and many functions work on vectors elementwise.

radiuses <- c(3.5, 6, 7, 1+1)
radiuses                    # print the vector
[1] 3.5 6.0 7.0 2.0
radiuses + 2                # add 2 to each element
[1] 5.5 8.0 9.0 4.0

Vector elements can have names, and the names are carried over in operations:

radiuses <- c(typical=3.5, large=6, huge=7, tiny=1+1)
radiuses
typical   large    huge    tiny 
    3.5     6.0     7.0     2.0 
areas <- pi*radiuses^2
areas
  typical     large      huge      tiny 
 38.48451 113.09734 153.93804  12.56637 
round( areas, 1 )           # round to one decimal place
typical   large    huge    tiny 
   38.5   113.1   153.9    12.6 

Some functions summarize a vector to a single number:

length( radiuses )          # number of elements in the vector
[1] 4
sum( radiuses )             # sum of elements
[1] 18.5
mean(radiuses)              # mean of elements
[1] 4.625
median(radiuses)            # median of elements
[1] 4.75

Integer sequence

A simple regular sequence you can create with : (colon operator). Try:

positions <- 5:15
positions
 [1]  5  6  7  8  9 10 11 12 13 14 15

You can also use the function seq, which allows to specify the step:

spacedPositions <- seq(5,15,3)
spacedPositions
[1]  5  8 11 14

Character vector

A character vector, so a vector of any texts, you may also create with the combine function c. Both single quotes ('word') and double quotes ("word") are allowed (must be identical at the beginning and the end of a text):

actors <- c( "Jay", "Gloria", "Claire", "Phil", 'Mitchell', 'Cameron' )
actors
[1] "Jay"      "Gloria"   "Claire"   "Phil"     "Mitchell" "Cameron" 

For example, a character vector is used to store names of vector elements. Use names function to access them:

radiuses <- c(typical=3.5, large=6, huge=7, tiny=1+1)
names( radiuses )                                   # get the names
[1] "typical" "large"   "huge"    "tiny"   
names( radiuses ) <- c("TYP", "LRG", "HUG", "TNY")  # set some new names
radiuses
TYP LRG HUG TNY 
3.5 6.0 7.0 2.0 

Logical vector

In R the following words denote logical values: FALSE and TRUE. You may also create a logical vector with the combine function c, but typically, logical vectors are results of logical conditions:

temperatures <- c( 36.4, 35.8, 37.2, 38.9, 40.2, 37.1, 35.8, 34.8 )
temperatures > 36.6                    # the result is a logical vector
[1] FALSE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE

Such a logical vector can be used to filter elements of another vector:

temperatures[ temperatures > 36.6 ]    # greater than 36.6
[1] 37.2 38.9 40.2 37.1
temperatures[ temperatures <= 36.6 ]   # less or equal to 36.6
[1] 36.4 35.8 35.8 34.8
temperatures[ temperatures != 35.8 ]   # not equal to 35.8
[1] 36.4 37.2 38.9 40.2 37.1 34.8
sel <- temperatures == 35.8
temperatures[ sel ]                    # exactly equal to 35.8
[1] 35.8 35.8
temperatures[ !sel ]                   # NOT (exactly equal to 35.8), same as !=
[1] 36.4 37.2 38.9 40.2 37.1 34.8
sel <- temperatures > 36 & temperatures < 38
temperatures[ sel ]                    # larger than 36 AND less than 38
[1] 36.4 37.2 37.1
sel <- temperatures <= 35 | temperatures >= 40
sel                                    # less or equal to 35 OR greater or equal to 40
[1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE

The function sum might be used on a logical vector to count how many elements are TRUE:

sum( temperatures > 36.6 )
[1] 4

Missing values

Use NA (not available) to represent a single missing value. Missing values can appear in vectors of any type. When a missing value appears in a calculation, the result is usually also a missing value:

radiuses <- c(3.5, 6, NA, 7, 1+1)
round( pi*radiuses^2, 1 )
[1]  38.5 113.1    NA 153.9  12.6

Descriptive statistics functions like mean, min, max, sum require the argument na.rm=TRUE to ignore missing values:

sum( radiuses, na.rm=TRUE )
[1] 18.5
mean( radiuses, na.rm=TRUE )
[1] 4.625

The function is.na returns a logical vector with TRUE at positions corresponding to NAs:

radiuses
[1] 3.5 6.0  NA 7.0 2.0
is.na( radiuses )
[1] FALSE FALSE  TRUE FALSE FALSE

When combined with the function sum, it counts the number of missing values in a vector:

sum( is.na( radiuses ) )
[1] 1

Or, when negated (with ! symbol) it counts the number of not missing values in a vector:

sum( !is.na( radiuses ) )
[1] 4

To select all non-missing elements do:

radiuses[ !is.na( radiuses ) ]
[1] 3.5 6.0 7.0 2.0

Vector elements

Use square brackets to select an element from a vector. The first element is at position 1. The following selects an element from the second position:

temperatures <- c( 36.4, 35.8, 37.2, 36.9, 36.1 )
temperatures[2]
[1] 35.8

The last element can be selected as follows:

temperatures[ length( temperatures ) ]
[1] 36.1

To select multiple elements at once:

temperatures[ c(3,1) ]   # third and first
[1] 37.2 36.4
positions <- 2:4
temperatures[positions]  # second, third, and fourth
[1] 35.8 37.2 36.9

Note, when a vector is printed, the number reported on the left in square brackets shows the position of the next vector element:

101:128
 [1] 101 102 103 104 105 106 107 108 109 110 111 112     # 101 is at [1st] pos
[13] 113 114 115 116 117 118 119 120 121 122 123 124     # 113 is at [113th] pos
[25] 125 126 127 128                                     # 125 is at [125th] pos

When a vector is named, the names can also be used in selection:

temperatures <- c( Day1=36.4, Day2=35.8, Day3=37.2, Day4=36.9, Day5=36.1 )
temperatures["Day3"]
Day3 
37.2 
selDays <- c( "Day4", "Day3" )
temperatures[ selDays ]
Day4 Day3 
36.9 37.2 

Structure and class

Use class to find what type of a vector is stored in a variable.
More information about the structure of a vector can be obtained with the function str.

positions <- 5:15
temperatures <- c( 36.4, 35.8, 37.2, 38.9, 40.2, 37.1, 35.8, 34.8 )
actors <- c( "Jay", "Gloria", "Claire", "Phil", 'Mitchell', 'Cameron' )
isFever <- temperatures > 36.6
class( temperatures )
[1] "numeric"
class( positions )
[1] "integer"
class( actors )
[1] "character"
class( isFever )
[1] "logical"
str( temperatures )
 num [1:8] 36.4 35.8 37.2 38.9 40.2 37.1 35.8 34.8
str( positions )
 int [1:11] 5 6 7 8 9 10 11 12 13 14 ...
str( actors )
 chr [1:6] "Jay" "Gloria" "Claire" "Phil" "Mitchell" "Cameron"
str( isFever )
 logi [1:8] FALSE FALSE TRUE TRUE TRUE TRUE ...

Estimate weights (exercise) 📖

Needs: “Vectors”.
Gives: sort.
Practices: c, round, sort.

Let’s assume that we want to estimate the weights of certain individuals. All we have is the following picture:

Let’s also assume that the individuals in the picture have healthy BMI and that we can roughly estimate their weights based on their heights. The heights in the picture are expressed in imperial units, e.g. 5’0’’ stands for 5 feet and 0 inches, where foot=12’’ and 1’’=2.54cm.

  1. Create a vector of heights of the individuals in metre based on your rough estimate of inches in the picture.
  2. Calculate the weights from the obtained heights by the equation \(Weight_{kg}=(Height_m-1)*100\).
  3. Calculate the mean of obtained weights rounded to 1 decimal.
  4. Include two additional weights 99 and 68 kg into your data.
  5. Sort the weights in decreasing order.
Solution…
heights_metre <- c(5*12+7,6*12+2,6*12+6,6*12+2,6*12)*2.54/100 # rough estimate in inches and conversion to meter
weights <- (heights_metre-1)*100                              # weight estimate
round( mean( weights ), 1 )                                   # mean weight rounded to 1 decimal
weights <- c(weights, 68, 99)                                 # adding additional weights
sort(weights, decreasing = TRUE)                              # sorting

Monthly spendings (exercise) 📖

Needs: “Vectors”.
Practices: c, names, “[…]”.

Below is the monthly spending of different categories in the first quarter:

spendJan <- c( transport = 50, rent = 500, sport = 33, food = 200, clothes = 100 )
spendFeb <- c( rent = 550, clothes = 30, food = 130, sport = 55, transport = 90 )
spendMar <- c( rent = 600, food = 180, transport = 70, clothes = 280, sport = 50 )

What is the total spending per category?

Solution…
nm <- names( spendJan )
spendJan[ nm ] + spendFeb[ nm ] + spendMar[ nm ]

Vectors, continuation (lecture) 👨‍🏫

Needs: “Vectors”, “:”, length, TRUE, FALSE, “RelOps”.
Gives: “strsplitExample”, “[…]”, sort, unique, “==”, “!=”, any, all, which, duplicated, “[logical]”, paste, paste0.

Splitting a text

Splitting a single text at a certain character (here :) into a vector of words can be achieved as follows.
(Note: this is a more advanced example, the strsplit, [[...]] and lists will be introduced later.)

# Here is a single text with many words separated by :
text <- "apple:banana:lemon:apple:raspberry:pear:apple:lemon" 

# More advanced code, which separated text at : into a vector of words
fruits <- strsplit( text, ":" )[[1]]                            

# Now, the fruits vector contains the words (a character vector)
fruits                                                         
[1] "apple"     "banana"    "lemon"     "apple"     "raspberry" "pear"     
[7] "apple"     "lemon"    
length( fruits )
[1] 8

Fruits from the third to the fifth position are:

fruits[ 3:5 ]
[1] "lemon"     "apple"     "raspberry"

Sorting

And here is the sorted list of fruits:

sort( fruits )
[1] "apple"     "apple"     "apple"     "banana"    "lemon"     "lemon"    
[7] "pear"      "raspberry"
sort( fruits, decreasing = TRUE )
[1] "raspberry" "pear"      "lemon"     "lemon"     "banana"    "apple"    
[7] "apple"     "apple"    

The sort function can also be used to sort numbers:

numbers <- c( 3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5 ) # some numbers
sort( numbers )
 [1] 1 1 2 3 3 4 5 5 5 6 9
sort( numbers, decreasing = TRUE )
 [1] 9 6 5 5 5 4 3 3 2 1 1

Unique/duplicated elements

Here are the unique fruits:

unique( fruits )
[1] "apple"     "banana"    "lemon"     "raspberry" "pear"     

This logical vector tells which elements has appeared before:

duplicated( fruits )
[1] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE

These elements are at the following positions:

which( duplicated( fruits ) )
[1] 4 7 8

Similarly for numbers:

numbers
 [1] 3 1 4 1 5 9 2 6 5 3 5
sort( unique( numbers ) )      # unique numbers, sorted
[1] 1 2 3 4 5 6 9
duplicated( numbers )          # has this element appeared before: TRUE/FALSE
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
which( duplicated( numbers ) ) # at which positions are duplicates
[1]  4  9 10 11

All and any

The following logical vector shows which fruits are apple (or not apple):

fruits
[1] "apple"     "banana"    "lemon"     "apple"     "raspberry" "pear"     
[7] "apple"     "lemon"    
fruits == "apple"            # is this element "apple": TRUE/FALSE
[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE
fruits != "apple"            # is this element not "apple": TRUE/FALSE
[1] FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE

Are there any apple elements in the vector:

any( fruits == "apple" )     # is there any element "apple": a single TRUE or FALSE
[1] TRUE

Or, are all fruits not apple:

all( fruits != "apple" )     # are all elements not "apple": a single TRUE or FALSE
[1] FALSE

Some examples with numbers:

numbers
 [1] 3 1 4 1 5 9 2 6 5 3 5
numbers >= 5 & numbers < 7        # is this element greater/equal 5 and less than 7: TRUE/FALSE
 [1] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
any( numbers >= 5 & numbers < 7 ) # is there any element greater/equal 5 and less than 7, a single TRUE or FALSE
[1] TRUE

Selecting elements

What are the other fruits than apples (duplicates possible):

fruits[ fruits != "apple" ]
[1] "banana"    "lemon"     "raspberry" "pear"      "lemon"    
fruits[ !(fruits == "apple") ]
[1] "banana"    "lemon"     "raspberry" "pear"      "lemon"    

At which positions is the lemon:

which( fruits == "lemon" )    # not a logical vector, but integer positions
[1] 3 8

What are the numbers which are smaller than 5 or greater/equal 7:

numbers[ numbers < 5 | numbers >= 7 ]
[1] 3 1 4 1 9 2 3

Producing text by merging elements

For reporting there is often a need to merge elements into a single text:

uqFruits <- unique( fruits )                       # character vector, several elements
uqFruitsTxt <- paste( uqFruits, collapse = ", " )  # single text: fruit names separated by ,
txt <- paste0("Unique fruits: ", uqFruitsTxt, ".") # single text: a full sentence
txt
[1] "Unique fruits: apple, banana, lemon, raspberry, pear."

Another example with numbers:

paste0( sort( unique( numbers ) ), collapse="<" )
[1] "1<2<3<4<5<6<9"

Descriptive statistics (exercise) 📖

Needs: “Vectors”, sum, min, max, median, fivenum, head, sort, tail, NA, “RelOps”.
Practices: length, sum, max, min, median, fivenum, head, tail, sort, sum, NA, “RelOps”.

You are given a numerical vector heights representing the heights(cm) of a group of individuals. For some reasons some of these individuals have missing value (NA) for their heights.

heights <- c(180, 170,  NA,  NA, 174, 178, 169, 174,  NA, 174, 170, 174, 173,
  168, 170, 166,  NA, 187, 170, 164,  NA, 167, 178, 165, 164)
  • How many individuals are in the group?
  • How many individuals have their height not reported?
  • What is the height range (minimum and maximum)?
  • If the individuals are ordered by their height in a row, what is the height of the individual standing in the middle of the row? If no middle exists in case of odd number of reported heights then take the average of the two in the middle. Which function in R does exactly this?
  • Try fivenum(see ?fivenum) on heights vector and inspect the results.
  • Extract and print 5 known lowest and 5 known largest heights.
  • How many individuals are within the range (175,180), excluding both 175 and 180?
  • How many individuals are at most 175 cm or at least 180 cm tall?
  • Create a variable validHeights with only the valid heights (remove NA values).
Solution…
# How many individuals
length(heights)

# How many individuals have their height not reported
sum(is.na(heights))

# Height range
max(heights, na.rm=TRUE)
min(heights, na.rm = TRUE)
range( heights, na.rm=TRUE)

# Middle height
median(heights, na.rm=TRUE)

# Fivenum gives min/median/max and two quartiles (see ?fivenum)
fivenum(heights)

# 5 lowest and 5 largest heights
head(sort(heights), 5)
tail(sort(heights), 5)

# Number of individuals within the range (175,180), uses & (logical AND)
sum(heights > 175 & heights < 180, na.rm=TRUE) 

# Number of individuals at most 175 cm or at least 180 cm tall, uses | (logical OR)
sum(heights <= 175 | heights >= 180, na.rm=TRUE) 

# Valid heights, uses negation of is.na result
validHeights <- heights[!is.na(heights)]

Snacks in a kindergarden (exercise) 📖

Needs: “strsplitExample”, length, “[…]”, “:”, sort, unique, “==”, “!=”, any, all, which, duplicated.
Practices: length, “==”, any, unique, duplicated, sort, which.

In a kindergarten there are children, each of a different name. The children are eating snacks. Each child should get exactly one snack. Here, in nms, there are the names of the children in the order of picking snacks (every time a single snack was picked):

nmsTxt <- "Oliver Mia Walter Charlie Liam Kate Hank Gina Zoe Eva Xena Tom Nora Jack Cathy Helen Yuri Rita Ivan Ben Adam Daisy Liam Quinn George Sam Vera Pam Eve Bob Alice Frank David Fred Yuri Mia Yuri Ursula"
nms <- strsplit( nmsTxt, " " )[[1]]  # split the text into words, will be introduced later
  1. How many snacks were picked?
  2. How many kids picked a snack?
  3. Did Jane get a snack?
  4. Are there children who picked a snack more than once?
  5. What are the names of the children who picked a snack more than once (but don’t mention any child twice)?
  6. Can you produce an alphabetical list of children names, skipping duplicates?
  7. Assuming that Eva picked exactly one snack: how many snacks were picked before Eva did, and how many after?
  8. Fred also took just one snack: what are the unique names of the children who picked a snack after Fred?
Solution…
# num of snacks taken
length(nms) 

# num of kids
length(unique(nms)) 

# Jane got a snack?
any(nms == "Jane")

# more snacks picked than kids?
length(unique(nms)) < length(nms) 

# which kids took more than one snack?
unique(nms[duplicated(nms)])

# alphabetical list of kids
sort(unique(nms))

# how many kids before "Eva"?
which(nms == "Eva")-1

# how many kids after "Eva"?
length(nms) - which(nms == "Eva")

# kids after "Fred"
unique( nms[(which(nms == "Fred")+1):length(nms)] )

Sampling from the normal distribution (lecture) 👩‍🏫

Needs: “callingFunctions”, “console”, head, min, max.
Gives: rnorm, hist, quantile, median.

The function rnorm can be used to sample data from the normal distribution:

vs <- rnorm( 100 ) # sample 100 values from the normal distribution, mean=0, sd=1
head( vs, 10 )
 [1]  1.1118376  0.2168103 -0.9618970  1.3875415  0.7494128  1.9679026
 [7]  1.8170746 -0.0663194  1.3377349 -0.7345405

It is possible to specify the mean and standard deviation of the distribution:

vs <- rnorm( n=500, mean=10, sd=2 ) # 500 samples, mean=10, sd=2
head( vs, n=10 )
 [1] 11.867961 13.182737 10.090777 10.927213 11.205689  4.551860 10.571278
 [8] 10.785919  9.666804  5.550006

A simple histogram can be created with hist to visualize the distribution of the sampled data:

hist( vs )

The hist function has many arguments to customize the plot. For example (check ?hist for more details):

hist( vs, breaks=20, col="lightblue", main="Sampled with rnorm(500,10,2)" )

The quantile function can be used to calculate quantiles of the data (using many different methods). For example, to calculate the 0, 0.1, 0.25, 0.5, 0.75, 0.9 and 1 quantiles:

quantile( vs, c(0, 0.1, 0.25, 0.5, 0.75, 0.9, 1) )
       0%       10%       25%       50%       75%       90%      100% 
 4.356330  7.454251  8.697243 10.044125 11.281519 12.653580 16.355892 

Compare the calculated quantiles with:

min( vs )       # quantile 0
[1] 4.35633
median( vs )    # quantile 0.5
[1] 10.04412
max( vs )       # quantile 1
[1] 16.35589

Factors (lecture) 👩‍🏫

Needs: “Vectors”, NA, length, str, c, “[…]”, NA.
Gives: “Factors”, levels, nlevels, as.integer, table, names.

A factor is a vector of categorical values (values chosen from a set of predefined possibilities, called levels).

For example, a factor can be used to store hair color of some people. Let’s assume that we can classify each person to one of the following hair colors. This will be the levels of the factor:

c( "black", "brown", "ginger", "blond", "gray", "white" )

Now, let’s assume that for 10 people we observed the following hair colors.
NA (not available, missing value) stands for a person whose hair color could not be determined.
This will be the values of the factor:

c("brown","ginger","black","brown",NA,"brown","black","brown","blond","gray") 

To create the factor and store it in a variable hairColors, do:

hairColors <- factor(
  c("brown","ginger","black","brown",NA,"brown","black","brown","blond","gray"),
  levels = c( "black", "brown", "ginger", "blond", "gray", "white" )
)
hairColors
 [1] brown  ginger black  brown  <NA>   brown  black  brown  blond  gray  
Levels: black brown ginger blond gray white

Factor properties and elements

A short presentation of a factor can be obtained with the str function:

str( hairColors )
 Factor w/ 6 levels "black","brown",..: 2 3 1 2 NA 2 1 2 4 5

The length of the factor is the number of observations:

length( hairColors )
[1] 10

And elements of the factor can be accessed by the square bracket operator:

hairColors[ 3:6 ]
[1] black brown <NA>  brown
Levels: black brown ginger blond gray white

Elements of a factor can also have names. For example:

names( hairColors ) <- c( "Alice", "Bob", "Charlie", "Daisy", "Eve", 
  "Frank", "George", "Helen", "Ivan", "Jack" )
hairColors
  Alice     Bob Charlie   Daisy     Eve   Frank  George   Helen    Ivan    Jack 
  brown  ginger   black   brown    <NA>   brown   black   brown   blond    gray 
Levels: black brown ginger blond gray white

Naturally, the names can be used to access the elements:

hairColors[ "Alice" ] <- "ginger"   # this changes the hair color of Alice
hairColors
  Alice     Bob Charlie   Daisy     Eve   Frank  George   Helen    Ivan    Jack 
 ginger  ginger   black   brown    <NA>   brown   black   brown   blond    gray 
Levels: black brown ginger blond gray white

Levels and coding

In R each factor is stored as two vectors.
The first vector codes the levels with subsequent integers: 1=black, 2=brown, …
Type the following to get this vector:

levels( hairColors )
[1] "black"  "brown"  "ginger" "blond"  "gray"   "white" 

and to find the number of factor levels enter:

nlevels( hairColors )
[1] 6

The second vector keeps integers corresponding to the factor values.
Try the following and understand the relation of the integers and the levels:

as.integer( hairColors )
 [1]  3  3  1  2 NA  2  1  2  4  5

Count occurrences

Use the table function to count occurrences of each level of a factor:

table( hairColors )
hairColors
 black  brown ginger  blond   gray  white 
     2      3      2      1      1      0 
occs <- table( hairColors )
occs[ "brown" ]
brown 
    3 

A factor from raw data, data cleaning (exercise) 📖

Needs: “Vectors”, table, NA.
Gives: “Factors”, factor, addNA.
Practices: factor, addNA, c, NA, table.

Raw data

Let’s assume, that an expensive survey was conducted, and the participants were asked to fill in one of the following values: low, medium, high. Some participants made typos. The data needs to be cleaned, trying to preserve as much information as possible.

The following chunk of code produces a character vector vs with the values obtained from the survey.

vsTxt <- "low low low low medium L low medium medium low low low low low ? low looow low low high low low mediun medium low right low low low medium HIGH low low low low low low low Medium low low low low low high medium low low low medium low low low high low low medium low high medium medium low medium high L ?"
vs <- strsplit( vsTxt, " " )[[1]]  # split the text into words, will be introduced later
vs[10] <- NA
head( vs, 16 ) # the first 16 responses
 [1] "low"    "low"    "low"    "low"    "medium" "L"      "low"    "medium"
 [9] "medium" NA       "low"    "low"    "low"    "low"    "?"      "low"   

Building a factor

The goal is to create a factor fs from the corrected values of the vs vector:

  • Print, how many raw responses are there.
  • Create the factor with the levels specified in the "low", "medium", "high" order, and then add the NA level with the addNA function.
  • Finally, print the counts of each of the levels in the factor.

Modifications of the raw input are not allowed (don’t correct the words in vsTxt manually). Instead, create a named character vector m with names corresponding to the raw spelling, and values corresponding to the correct levels, e.g. m <- c( "HIGH"="high", ... ). Then, map the vs vector through m using the square bracket operator. When you can’t decide what should be the level, use NA.

Note: For a tidyverse solution, you could use fct_count to count the levels, fct_collapse to map the words to correct levels, fct_relevel to reorder the levels.

Solution…
### base-R variant
length( vs )
table( vs )
m <- c( "?"=NA, "high"="high", "HIGH"="high", "medium"="medium", "mediun"="medium", "Medium"="medium", "looow"="low", "low"="low", "L"="low" )
table( m[ vs ] )
fs <- factor( m[ vs ], levels = c( "low", "medium", "high" ) )
fs <- addNA( fs )
table( fs )

### tidyverse variant
fs <- factor( vs )
fct_count( fs )
fs <- fct_collapse( fs, low = c( "L", "low", "looow" ), medium = c( "medium", "Medium", "mediun" ), high = c( "HIGH", "high" ), other_level = NA )
fct_count( fs )
fs <- fct_relevel( fs, "low", "medium", "high" )
fct_count( fs )

Generate sample of female heights (exercise) 📖

Needs: “Vectors”, “RelOps”, round, rnorm, names, mean, sd, hist, sort, quantile.
Gives: sample.
Practices: rnorm, round, sample, names, mean, sd, hist, sort, head, quantile.

Let’s assume that you are expected to work with data describing heights of female students.
The real data is not available yet, so you need to generate a random sample.

Generate

Generate a named vector heights:

  • There should be studentsNum elements in the vector.
  • Values should be sampled from the normal distribution with a mean femaleMeanHeight and a standard deviation femaleSdHeight.
  • Names should be randomly sampled from the provided femaleNames vector. Find out how to use the sample function to perform this task.
studentsNum <- 30
femaleMeanHeight <- 170.7 # cm
femaleSdHeight <- 6.3     # cm
femaleNamesTxt <- "Anastasia,Melissa,Nora,Marlene,Isabel,Eleonora,Amelie,Katharina,Leonie,Sophie,Zoe,Vera,Julia,Magdalena,Elena,Karina,Serena,Viviana,Carmen,Ella,Anita,Ines,Beatrice,Susanne,Katja,Yvonne,Noemi,Miranda,Sonja,Clara,Valentina,Luisa,Victoria,Stefanie,Elisabeth,Martina,Sophia,Sara,Lea,Lorena,Patricia,Emma,Tanja,Nicole,Silvana,Martha,Selina,Melanie,Jolanda,Lucia,Isabella,Felicia,Giulia,Marina,Samantha,Jasmin,Lucie,Corina,Luna,Matilda,Jessica,Lena,Francesca,Alessia,Claudia,Lavinia,Nadia,Regina,Carolina,Adriana,Natalie,Nina,Sylvia,Livia,Miriam,Maria,Mia,Diana,Fiona,Patrizia,Helena,Judith,Laura,Gabriela,Liliana,Tara,Selena,Mona,Paula,Valeria,Verena,Tatjana,Ariana,Alina,Anna,Rebecca,Teresa,Marisa,Gloria,Luciana"
femaleNames <- strsplit( femaleNamesTxt, "," )[[1]]
Solution…
heights <- round( rnorm( n=30, mean=femaleMeanHeight, sd=femaleSdHeight ), 0 )
heights <- setNames( heights, sample( femaleNames, size=length(heights) ) )
heights

Study

Study the generated heights vector:

  • Find the mean and standard deviation of the heights. Are these values far from the parameters used to generate the heights?
  • Produce the histogram of the heights.
  • Find the names of students with heights above the sample mean.
  • Print the heights and the names of the 5 tallest students (sorted with decreasing heights).
  • What are the heights corresponding to the quantiles 0.3 and 0.7?
Solution…
hist(heights)
mean(heights)
sd(heights)

names( heights[ heights > mean(heights) ] )
head( sort(heights, decreasing = TRUE), 5 )
quantile( heights, c(0.3, 0.7) )

Data manipulation 1/3

The pipe operator (lecture) 👨‍🏫

Needs: rnorm, mean, round, sum.
Gives: “|>”, “%>%”.

We want to calculate the mean of 100 randomly generated values from the normal distribution and round it to two decimal places. A step by step solution in statements would look like this:

vs <- rnorm(n=100, mean=5, sd=2)
vsMean <- mean(vs)
round(vsMean, 2)

where two variables vs and vsMean were introduced. A more compact solution would be to call the functions in composition without introducing variables such vs and vsMean:

round( mean(rnorm(n = 100, mean = 5, sd = 2)), 2)

R version 4.1.0 introduced the pipe operator |> (earlier %>% in magrittr R package part of tidyverse). It allows to write the code in a “chained” manner. The pipe operator passes the result of the left-hand side expression as the first argument of the right-hand side function. The expression sum(1:10) is equivalent to 1:10 |> sum().

The pipe notation represents the “data flow” in the expression towards the end result, it also corresponds well to the order of the solution to the problem:

rnorm(n=100, mean=5, sd=2) |> mean() |> round(2)

The tidyverse library is carefully designed to work well with the pipe operator.Complex data manipulation tasks can be written in a compact and readable way.

Tibble (lecture) 👨‍🏫

Needs: “Vectors”.
Gives: tibble, “[[”, “[”.

In previous session we imported existing tables as tibble, a data type defined in tidyverse package. To create a tibble manually in R we use the function tibble:

d <- tibble(
    name = c("Gabriela", "Gale", "John", "Guy", "George"),  
    sex = c("female", "female", "male", "male", "male"),
    age = c(17, 17, 20, 18, 17))
d
# A tibble: 5 × 3
  name     sex      age
  <chr>    <chr>  <dbl>
1 Gabriela female    17
2 Gale     female    17
3 John     male      20
4 Guy      male      18
5 George   male      17

You can extract variables(columns) as vectors with [[ and $, or as a separate tibble with [.

d[["name"]] # <=> d$name : extracts 'name' as a vector
d["name"]   # extracts 'name' as tibble
d[c("name", "age")] 

Similar to lists, variables in a tibble may be updated with an assignment <-:

d["height"] <- rep(NA, nrow(d))        # add variable 'height' with missing values
d["height"] <- NA                      # equivalent (see recycling)
d["weight"] <- sample(60:80,nrow(d))   # add 'weight' variable with random values

With the pull function (tidyverse) we can extract a (named) vector. Recall that we can extract vectors from a tibble with [[ or $, however the advantage of the pull function is that it can be used in a pipe expression.

d[['weight']]                        # <=> d$weight
d |> pull( weight )                  # weight as a vector 
d |> pull(var = weight, name = name) # named vector ; Note the argument 
                                     # names (see ?pull) 

The pipe operator (exercise) 📖

Needs: “|>”, “%>%”, sort, tail, mean, round, rnorm, paste.
Practices: “|>”, “%>%”.

Solve the following two tasks using the non-pipe and the pipe versions of the code:

  • A vector with first names of some people is given, e.g. c( "Tom", "Jerry", "Mickey", "Donald" ). Sort the names in the reversed alphabetical order and concatenate them into a single string separated by commas (hint: see paste for character concatenation).

  • A vector of person heights is given, e.g. hs <- rnorm( n=100, mean=170, sd=10 ).
    Calculate the mean height of the 5 tallest persons. Round the result to one decimal place.

Solution…
# 1
c( "Tom", "Jerry", "Mickey", "Donald" ) |> sort(decreasing=TRUE) |> paste( collapse=", " )
# 2
hs <- rnorm( n=100, mean=170, sd=10 )
hs |> sort() |> tail(5) |> mean() |> round(1)

Tibble (exercise) 📖

Needs: tibble, mean, “RelOps”, “|>”.
Practices: tibble, mean, “RelOps”, “|>”.

  1. Create a tibble exercise_group for a group of individuals with names {Sonja, Steven, Ines, Robert, Tim} with their height {164, 188, 164, 180, 170}, weight {56.0, 87.0, 54.0, 80.0, 58.5} and frequency of exercise {high, high, low, moderate, low}.

  2. Add a new numeric variable sex (1=‘male’, 2=‘female’) to exercise_group tibble.

  3. Calculate the mean height for males and females.

  4. Add a new candidate Eve to exercise_group with height=170, weight unknown and exercise level high.

  5. Calculate the mean weight. Use the pipe operator and thereby avoid creating intermediate variables.

  6. Create a named vector of weights of the individuals with high exercise level.

Solution…
# 1)
exercise_group <- tibble(name=c("Sonja" , "Steven", "Ines", "Robert", "Tim" ),
                         height=c(167, 185, 164, 180, 172),  
                         weight=c(56.0, 87.0, 54.0, 80.0, 58.5),
                         exercise=c("high", "high", "low", "moderate", "low")
                  )  
# 2)
exercise_group[['sex']] <- c(2,1,2,1,1)

# 3)
#
# Base R solution is used here, try this with 'filter' which will 
# be explained later.
#
heights <- exercise_group[["height"]]  
females <- exercise_group[["sex"]]==2
males <- exercise_group[["sex"]]==1
mean(heights[males])
mean(heights[females])

# 4)
exercise_group <- tibble(name = c( exercise_group$name, "Eve"),
       height = c( exercise_group$height, 170),
       weight = c( exercise_group$weight, NA),
       exercise = c( exercise_group$exercise, 'high'),
       sex = c( exercise_group$sex, 2)
       )

# 5) 
exercise_group |> pull(weight) |> mean(na.rm=TRUE) 

# 6)
weight_levels <- exercise_group |> pull(weight, name)
weight_levels[ (exercise_group |> pull(exercise, name)) == "high"]

Select: choosing columns (lecture) 👩‍🏫

Files: pulseNA.csv.
Needs: “pulseNA.csv”, read_csv, head, tail, “tidyverse”, “|>”, “%>%”.
Gives: select, rename.

The select function (from tidyverse) is used to select variables(columns) of a table. Given a table the select can be used to:

  • take a collection of variables and simultaneously reorder and/or rename the variables
  • reshuffle the variables order

You’ll need to load tidyverse library to be able to use select, also import the dataset pulseNA.csv into the variable d:

library(tidyverse)
d <- read_csv("pulseNA.csv", show_col_types=FALSE)

Execute the following code and observe the results (note, the d table is not modified as there are no assignments to d):

d                                                   # all cols, all rows
d |> head()                                         # top rows (default 6)
d |> select( weight, height )                       # 2 columns, all rows
d |> select( gender, weight, height ) |> head(n=3)  # 3 columns, 3 top rows
d |> 
  select( gender, weight, height, everything() ) |> 
  head(n=3)                                         # top 3 rows of reordered columns 
                                                    # Note: |> multi-line statement 
d |> select( -id, -name, -year ) |> head()          # no id, name, year columns
d |> tail()                                         # last rows (default 6)
d |> select( WEIGHT=weight, height ) |> tail()      # 2 columns (one renamed)

d |> rename( WEIGHT=weight ) |> tail()              # all columns, but one renamed

Note, the lines above do not modify the d table. If you want to store a modified table, you need to assign the result to a variable:

smallD <- d |> select( weight, height )
smallD |> head()

Filter: choosing rows (lecture) 👨‍🏫

Files: pulseNA.csv.
Needs: “pulseNA.csv”, “|>”, head, tail, “tidyverse”, is.na.
Gives: filter, “%in%”, “RelOps”.

The filter function is used to select rows of a table that meet certain conditions.

Manually run the pieces of code given below. Compare the output of the following lines of code to each other. Understand how you filter rows based on different conditions. Observe, how to build longer pipes by combining filtering with sorting (or with calculating the number of rows).

Load the pulseNA.csv file into the d variable. The filter function comes from the tidyverse library, so you need to load it with library(tidyverse).

library(tidyverse)
d <- read_csv( "pulseNA.csv" )

d |> filter( weight == 90 )
d |> filter( weight > 90 ) |> arrange( weight )  # arrange sorts by weight column
d |> filter( weight >= 90 ) |> arrange( weight ) # (low weights at the top)

Multiple conditions Logical conditions may be combined using and (‘&’) and or (‘|’) operators where:

  • A & B : both conditions A and B must be TRUE
  • A | B : at least one of A or B must be TRUE
d |> filter( weight < 50 | weight > 100 )             # OR
d |> filter( weight >= 80 & weight < 90 )             # 80 yes, 90 no (AND)
d |> filter( between( weight, 80, 90 ) )              # 80 yes, 90 yes
d |> filter( weight >= 80, weight < 90 )              # 80 yes, 90 no (comma and AND)
d |> filter( weight >= 80 ) |> filter( weight < 90 )  # 80 yes, 90 no (two steps are AND)

Negation operator (‘!’) takes the complement of the logical expression, i.e. if the outcome of logical expression A is TRUE then !A is FALSE and vice versa.

‘%in%’ operator Tests whether the elements of a vector are present in another and returns a logical vector of the size of the left operand.

d |> filter( gender == "male" )
d |> filter( !( gender == "male" ) )
d |> filter( gender != "male" )        # equivalent to the previous line
d |> filter( weight < 50 | weight > 100 ) |> filter( gender != "male" )
d |> filter( gender == "male" ) |> nrow()

d |> filter( exercise == "low" )
d |> filter( exercise == "low" | exercise == "moderate" )
d |> filter( exercise %in% c( "low", "moderate" ) )

selExercises <- c( "low", "moderate" )
d |> filter( exercise %in% selExercises )

d |> filter( !is.na( exercise ) )      # remove rows where exercise is missing
d |> filter( !is.na( exercise ) | !is.na( pulse2 ) )   # remove rows where
                                       # exercise or pulse2 is missing

Arrange: sorting rows (lecture) 👩‍🏫

Files: pulseNA.csv.
Needs: “pulseNA.csv”, 🔴“LcPipeOperator”, 🔴“LcSelect”, head, tail, 🔴tidyverse.
Gives: arrange.

The arrange function (tidyverse) is used to sort rows of a table based on the values in one or more columns.

As it was the case for select and filter you’ll need to load tidyverse library to be able to use arrange, also import the dataset pulseNA.csv:

library(tidyverse)
pulse <- read_csv( "pulseNA.csv" )

Run the code below. Compare the output of the following lines of code to each other. Understand how you sort rows of the table and print table parts.

pulse |> arrange( height ) |> head()
pulse |> arrange( desc( height ) ) |> head()
pulse |> arrange( height ) |> tail()
pulse |> arrange( height ) |> slice(3:8)

Also compare the output between each of these longer table manipulation “pipe” expressions:

pulse |> 
    select( name, age, height, weight ) |> 
    arrange( age ) |> 
    head(10)

pulse |> 
    select( name, age, height, weight ) |> 
    arrange( age, weight ) |> 
    head(10)

pulse |> 
    select( name, age, height, weight ) |> 
    arrange( age, desc(weight) ) |> 
    head(10)

The tidyverse library is designed to make the code more readable and easier to understand. For example, the last expression given above is equivalent to the following base-R code:

head( d[
    order( d$age, d$weight, decreasing = c( FALSE, TRUE ) ), 
    c( "name", "age", "height", "weight" ) 
], 10 )

Filter (exercise) 📖

Files: pulseNA.csv.
Needs: “pulseNA.csv”, “|>”, “%>%”, head, tail, “tidyverse”, filter, “%in%”, “RelOps”.
Practices: filter, “%in%”, “RelOps”.

Import pulseNA.csv into variable d:

  1. Filter the rows of the d table to select only the rows where the age is one of: 18 or 21. Propose two ways to do this.

  2. Filter the rows of the d table with weight more than 60 but not more than 70.

  3. Filter the rows of the d table to select only the rows where there is missing data on exercise and the participant was running.

  4. Filter the rows of the d table to select only the rows where the exercise is not missing and the participant is drinking alcohol.

Solution…
# 1)
d |> filter( age == 18 | age == 21 )
d |> filter( age %in% c( 18, 21 ) )

# pre-selected set of ages in 'selAges'  
selAges <- c( 18, 21 ) 
d |> filter( age %in% selAges )

# 2)
d |> filter( weight > 60 & weight <= 70 )

# 3)
d |> filter( is.na( exercise ), ran == TRUE )

# 4)
d |> filter( !is.na( exercise ), alcohol == "yes" )

Mutate: modifying the table (lecture) 👩‍🏫

Files: pulseNA.csv.
Needs: 🔴“surveyNA.csv”, 🔴“LcPipeOperator”, 🔴“LcSelect”, 🔴tidyverse.
Gives: if_else, mutate.

The mutate function is used to add or modify (new) variables(columns) in the table. It can use the values of existing variables to calculate the values of new ones. This operation does not change the number of rows in the table.

pulse <- read_csv( "pulseNA.csv" , show_col_types = FALSE)

pulse |> 
  select( height ) |> 
  mutate( isVeryTall = height > 190 ) |> 
  head()

pulse |> 
  select( height ) |> 
  mutate( howTall = if_else( height > 190, "very", "not_very" ) ) |> 
  head()

pulse |> 
    select( pulse1, pulse2 ) |> 
    mutate( pulseDiff = pulse2 - pulse1 ) |> 
    head()

pulse |> 
    select( pulse1, pulse2 ) |> 
    mutate( minPulse = pmin(pulse1, pulse2), maxPulse = pmax(pulse1, pulse2) ) |> 
    head()

pulse |> 
    select( pulse1, pulse2 ) |> 
    mutate( 
        minPulse = if_else( pulse1 < pulse2, pulse1, pulse2 ), 
        maxPulse = if_else( pulse1 < pulse2, pulse2, pulse1 ) ) |> 
    head()

Mutate: modifying the table (exercise) 📖

Files: pulseNA.csv, survey.csv.
Needs: 🔴“LcMutate”, “pulseNA.csv”, 🔴“survey.csv”, if_else, 🔴“LcMutate”, 🔴tidyverse.
Gives: case_when, geom_line.

  1. Take the following vectors weight (pound) and height (inch) and store them as variables(columns) of a tibble. Add a column BMI to this tibble where \(BMI=weight_{(kg)}/height_{(m)}^2\), and round the BMI values to one decimal place. Note that \(1(in) = 0.0254(m)\) and \(1(lb) = 0.4535923(kg)\).
weight <- c( 113, 137, 153, 142, 133, 123, 131, 136 ) # unit: pound
height <-  c( 66, 72, 69, 68, 68, 69, 70, 70 )        # unit: inch
  1. The following table is one of many BMI classifications:
classification bmi
1 underweight <18.5
2 normal 18.5-24.9
3 overweight 25-29.9
4 obese >=30

Add the variable BMI_class to the pulse tibble based on the classification in the table above. Use the function case_when as a better alternative for if_else.

Solution…
# 1
d <- tibble(
  weight = c( 113, 137, 153, 142, 133, 123, 131, 136 ), # unit: pound
  height = c( 66, 72, 69, 68, 68, 69, 70, 70 )          # unit: inch
)

d |> 
  mutate(weight_kg=weight*0.4535923, height_m=height * 0.0254) |>
  mutate(BMI=round(weight_kg/(height_m)^2, 1) ) |> 
  select(weight, height, BMI)

# 2, bad solution, think about BMI set to 24.95
pulse<- read_csv("pulseNA.csv", show_col_types = FALSE)

pulse |> 
  mutate(BMI=round( weight/((height/100)^2), digits = 1)) |> 
  mutate(BMI_class = case_when(
                BMI < 18.5 ~ "underweight",
                BMI >= 18.5 & BMI <= 24.9 ~ "normal",
                BMI >= 25   & BMI <= 29.9 ~ "overweight",
                BMI >= 30 ~ "obese"
              )                     
      )

# 2, better solution
# case_when checks conditions in sequence, the second one is checked only if the first is FALSE,
# therefore in the second condition it is not necessary to check that the first was FALSE.

pulse |> 
  mutate(BMI=round( weight/((height/100)^2), digits = 1)) |> 
  mutate(BMI_class = case_when(
                BMI < 18.5 ~ "underweight",
                BMI < 25 ~ "normal",           # here we know already that BMI >= 18.5
                BMI < 30 ~ "overweight",       # here we know already that BMI >= 25
                TRUE ~ "obese"                 # TRUE means: all other cases
              )                     
      )

Select & Filter: Survey dataset (exercise) 📖

Files: survey.csv, survey.csv.gz.
Needs: “|>”, “%>%”, head, tail, “tidyverse”, filter, nrow, select, is.na, range, “|>”, “%>%”, “%in%”, “RelOps”.
Gives: “survey.csv”.
Practices: filter, nrow, select, is.na, range, “|>”, “%>%”, “%in%”, “RelOps”.

Survey dataset

The survey data set is available as comma-delimited (.csv) file: survey.csv (or additionally compressed as survey.csv.gz). This data frame contains the responses of 233 Statistics I students at the University of Adelaide to a number of questions. Here we use a slightly modified version of the survey data from the R MASS pacakge.

Variable Explanation
name Name of a participant
gender Sex (male/female)
span1 Span (distance from tip of thumb to tip of little finger of spread hand) of writing hand (cm)
span2 Span of non-writing hand (cm)
hand Writing hand of student (left/right)
fold Fold your arms! which is on top? (right/left/neither)
pulse Pulse measurement (rate per minute)
clap Clap your hands! which is on top? (right/left/neither)
exercise Frequency of exercise (freq/some/none)
smokes How much the student smokes (heavy/regul/occas/never)
height Height (cm)
m.i whether the student expressed height in imperial (feet/inches) or metric (centimetres/metres) units. (metric/imperial)
age Age of the student (years)

Queries

  1. Select teenagers, assume age range between and including 10 and 19.

  2. Select all females with pulse equal to 60.

  3. Select all male teenagers with pulse above 60.

  4. How many males do smoke and never exercise?

  5. How many females never smoke and frequently exercise?

  6. Produce the following tibbles:

    6.1 Personal information {Name,Age,Gender,Height} of all teenagers.

    6.2 Personal information of males with height between and inclusive 170 to 180.

  7. Has survey data missing pulse values? If so how many? What about age?

  8. What is the percentage of males who never smoke and frequently exercise? Do the same for female.

  9. What is the age range in teenagers? You may use the range function (?range).

  10. How many males do smoke and never exercise? Use ‘%in%’ operator.

  11. Add the variable age_group to survey based on the following age classification:

classification age notation
1 adult >19 (19,∞)
2 adolescent >=10 and <=19 [10,19]
3 child >=1 and <=9 [1,9]
4 infant <1 (-∞,1)
Solution…
survey <- read_csv("survey.csv", show_col_types = FALSE)
# 1) 
survey |> filter(10<=age & age<20)


# 2)
survey |> filter(pulse==60 & gender=="female")

# 3)
survey |> filter(pulse>60 & gender=="male" & (10<=age & age<20) )

# 4)
#
# The conditions are 'do smoke' and 'never exercise'. 
# With 'do smoke' we have the categories {regul,occas,heavy} therefore 'or' (|) 
# logical connective is in place. For 'never exercise' we have the category 'none'. 
# Since both conditions must be true we will have the 'and' (&) logical connective.

survey |> 
  filter((smokes=="regul" | smokes=="occas" | smokes=="heavy") & exercise=="none" & gender=="male" ) |> 
  nrow()
# Alternatively, a shorter condition for 'smokes' is smokes!="never". It means accept 
# all values for 'smokes' as long as it is not equal to "never" and those are 
# exactly {regul,occas,heavy}.

survey |> filter(smokes!="never" & exercise=="none" & gender=="male" ) |> nrow()

# 5)
survey |> filter(smokes=="never" & exercise=="freq" & gender=="female") |> nrow()

# 6)
    ### 6,1
    survey |> select(Name=name, Age=age, Gender=gender, Height=height) |> 
      filter(10<=Age & Age<20)
    
    ### 6.2
    survey |> select(Name=name, Age=age, Gender=gender, Height=height) |> 
      filter(170>=Height & Height<=180 & Gender=="male")

# 7)    
survey |> filter(is.na(pulse)) |>  nrow()  
survey |> filter(is.na(age)) |>  nrow()  
    

# 8) Extra exercise
# men
none_smoker_sportsmen <-  survey |> 
  filter(smokes=="never"  & exercise=="freq" & gender=="male")  |>  
  nrow()
total_men <- survey |> filter(gender=="male") |>  nrow() 
none_smoker_sportsmen / total_men

# women : replace the condition gender=="male" into gender=="female"

# 9)
survey |> filter(10<=age & age<20) |> pull(age) |>  range()

# 10)
survey |> filter( (smokes %in% c("regul","occas", "heavy")) & exercise=="none" & gender=="male" ) |> nrow()


# 11)
mutate(survey, age_group = case_when(
                age > 19 ~ "adult",
                age >= 10 & age <= 19 ~ "adolescent",
                age > 1   & age <= 9 ~ "child",
                age <= 1 ~ "infant"
              )                     
      )

Data manipulation 2/3

Data Read/Write (lecture) 👩‍🏫

Common Data File formats

  • .rds — R’s single-object data format [saveRDS(), readRDS()].
  • .rda or .RData — R’s workspace data format, saving multiple R objects [save(), load()].
  • .csv — Comma-separated values, a common format for tabular data. [read_csv(), write_csv()]
  • .tsv or .txt — Tab-separated or plain text files, often used for data files. [read_tsv(), write_tsv()]
  • .xlsx or .xls — Excel spreadsheet files (supported with readxl or openxlsx packages).
  • .json — JSON format, often used for web data (supported with jsonlite package).
  • .xml — XML format, used for hierarchical data (supported with XML or xml2 packages).
  • .sav, .sas7bdat, .dta — Proprietary formats from SPSS, SAS, and Stata (haven or foreign packages).
  • .yaml - Data serialisation language (yaml package) (yaml.org, pecification).

In R, data types vary in the level of meta-information they can hold, which refers to the ability to store additional descriptive data about the dataset beyond the raw values.

  1. Vectors: The most basic structure, they hold atomic data (numeric, character, etc.) with minimal metadata like length and type.

  2. Matrices and Arrays: Extend vectors by organizing data into two or more dimensions. They allow metadata like dimension names (rownames, colnames, dim etc.) but are still limited in complexity.

  3. Data Frames: A more flexible format, storing heterogeneous data (different types in columns) with richer metadata like column names, class attributes for specific columns (e.g., factors), additional metadata about printing behavior.

  4. Lists: Highly flexible, they can store any type of R object, including additional metadata and nested structures. Less flexible for tabular analysis.

  5. Specialized Objects: Formats like S3, S4, and R6 objects can include extensive metadata via attributes, enabling complex data structures with custom methods.

The higher the level of abstraction (like tibbles or specialized objects), the more meta-information the format can hold.

WARNING: writing and reading between different data file formats may lead to loss of meta information.

Data Read/Write (exercise) 📖

Data Read/Write

  1. Data format In the survival package use the small dataset aml and export it to the following data formats given Filename and Functions. Functions are available via the tidyverse and haven package.
Filename     Storage     Format  Functions           Package
aml.csv      ASCII Text  CSV     read_csv/write_csv  tidyverse
aml.dta      binary      STATA   read_dta/write_dta  haven
aml.rds      binary      R       read_rds/write_rds  tidyverse  (<=> readRDS/saveRDS)
aml.sas7bdat binary      SAS     read_xpt/write_xpt  haven
aml.sav      binary      SPSS    read_sav/write_sav  haven
aml.tsv      ASCII Text  TSV     read_tsv/write_tsv  tidyverse
  1. Prepare SPSS data in R The SPSS data format can:
  1. hold labels on each variable as a short description of the variable.
  2. label the values of a variable in case of categorical variables.

Labeling a variable (column) is achieved by assigning the attribute label to it with a value, with the value being the description of the column (see ?attr). The labels of a categorical variable are defined as a key/value pair (see ?haven::labelled).

Take the survival package table aml and create and SPSS equivalent object in R, where:

  • variables have labels as described in the aml help page (?survival::aml)
  • variable aml$x has values 1 and 2 for its levels Maintained and Nonmaintained respectively

Finally export the created SPSS object to a file, say aml.sav and read it back into R and check whether the labels are maintained.

  1. Excel sheets Read an excel sheet file of your own data using the function exported by readxl R package, namely read_xls() or read_xlsx(). You may need to install readxl R package.
Solution…
# 1)
# 
library(haven)
library(survival)  # aml dataset
# export
aml |> write_sav("aml.sav")      # SPSS 
aml |> write_csv("aml.csv")      # CSV
aml |> write_tsv("aml.tsv")      # TSV
aml |> write_dta("aml.dta")      # Stata
aml |> write_xpt("aml.sas7bdat") # SAS  
aml |> write_rds("aml.rds")      # Natvie R
# import
aml_sav <-read_sav("aml.sav") 
aml_tsv <-read_tsv("aml.tsv", show_col_types = FALSE) 
aml_csv <-read_csv("aml.csv", show_col_types = FALSE) 
aml_dta <-read_dta("aml.dta") 
aml_sas <-read_xpt("aml.sas7bdat") 
aml_rds <-read_rds("aml.rds") 

# The variables time and status are vectors whereas variable x is a factor with meta-information
# Inspect and compare the original aml$x  class  with all other imports.
#
# solution 1.1) 
class(aml$x)
class(aml_sav$x)
class(aml_dta$x)
# etc.


# solution 1.2) Alternatively: use of map function 
# 
# Object names (objs) are checked for their type (class) one by one
# using map function. The map function will be explained in later 
# lectures. See also ?base::get and ?setNames.
objs <- paste("aml", c("sav", "tsv", "csv", "dta", "sas", "rds"), sep="_") 
map(setNames(objs,objs), ~ class(get(.x)$x))   

# 2)
# 
# SPSS variables and variable labels can be labelled. It is possible, using 
# haven R package, to create these labels manually so they can be exported 
# to a file for usage in the SPSS environment if needed. 
#
# The variables are labelled using the R attributes assignments:
attr(x = aml$time,which = "label")  <- "survival or censoring time"
attr(x = aml$status,which = "label")  <- "censoring status"
attr(x = aml$x,which = "label")  <- "maintenance chemotherapy status"
# Values of a variable can be labelled with haven::labelled function:   
aml[["x"]] <- labelled(aml[["x"]], c(Maintained=1, Nonmaintained=2)) 
write_sav(aml,"aml.sav")
aml_sav <- read_sav("aml.sav")


# 3) 
#
# Read one of your own excel sheets using `readxl` R package functions, 
# e.g. read_xlsx. If you do not have an excel sheet then it is a good 
# practice to create an excel sheet manually and try to read it into R 
# environment.
#

Descriptive statistics: (group) summaries (lecture) 👨‍🏫

Files: pulseNA.csv.
Needs: “pulseNA.csv”, “survey.csv”, 🔴tidyverse.
Gives: drop_na.

Often we are interested count, mean, sum, variance, standard deviation etc. of a variable. In the context of tidyverse this is achieved by the function summarise.

pulse |> summarise(mean_height=mean(height), sd_height=sd(height))
pulse |> summarise(mean_height=mean(height), sd_height=sd(height), size=n()) 
pulse |> summarise(mean_height=mean(pulse1, na.rm=TRUE), size=n()) 

Prior to summarise you may be interested in a summary based on groups in the dataset. The groups may be constructed from one or more categorical variables. This is acheived by group_by function.

# group/summarise :  mean age per exercise group
#
pulse |>  
  group_by(exercise) |> 
  summarise(mean_age=mean(age))

# handling missing values : is.na/drop_na
#
pulse |> 
  filter(!is.na(exercise)) |> 
  group_by(exercise) |> 
  summarise(mean_age=mean(age, na.rm=TRUE))

# drop all observations with missing value(s)
#
pulse |> 
  drop_na() |>
  nrow()

# drop observations with missing gender value only
#
pulse |> 
  drop_na(gender) |> 
  nrow() 

# group/summarise :  mean age per exercise group with missing values removed
#
pulse |> 
  drop_na() |> 
  group_by(exercise) |>  
  summarise(mean_age=mean(age))

# summarise vs mutate ; note that summarise drops the group variable 
# but mutate does not, to drop groups use ungroup()
#
pulse |> group_by(gender) |> summarise(size=n())
pulse |> group_by(gender) |> mutate(size=n())
pulse |> group_by(gender) |> mutate(size=n()) |>  ungroup()

# group_by :  multiple variables 
#
# Note the message: 'summarise()` has grouped output by 'gender'. 
# You can override using the `.groups` argument.'
#
pulse |> drop_na() |> 
  group_by(gender, exercise) |> 
  summarise(mean_age=mean(weight))

# silent the message on 'summarise' with multiple variables with the 
# argument .groups='drop'
pulse |> 
  drop_na() |> 
  group_by(gender, exercise) |> 
  summarise(mean_age=mean(weight), .groups = 'drop')

# counts, frequency tables
pulse |> group_by(gender) |>  summarise(size=n())
pulse |> count(gender)
pulse |> count(age,gender)

# group/mutate 
# 
pulse |> 
  count(gender, exercise) |> 
  group_by(gender) |>  
  mutate(proportion_within_gender=n/sum(n)) |> 
  mutate(percentage_within_gender=round(proportion_within_gender*100,1)) |> 
  arrange(gender, desc(percentage_within_gender)) |> 
  ungroup()

Descriptive statistics: (group) summaries (exercise) 📖

Files: pulseNA.csv.
Needs: 🔴“LcSummarise”, “pulseNA.csv”, “survey.csv”, 🔴“LcMutate”, 🔴tidyverse.
Gives: slice_head.

  1. Per gender, calculate the mean and the standard deviation of the pulse before the exercise. Find how to perform these calculations with ignoring missing values. Name the columns meanPulseBefore and sdPulseBefore.

  2. How many students were there in each year of the experiment?

  3. Per year, calculate the number of students and the number of missing values in the exercise column. Provide the results in a single table with columns year, studentsNum, missingExerciseNum.

  4. For each gender and run levels, build a table with min, median, and max of known pulses after the exercise.

  5. Give a summary of number of missing age values per exercise group.

  6. From the pulse dataset retrieve the top 3 tallest individuals in males and in females. Has the final groups defined then remove the grouping.

Solution…
# 1
d |> 
    group_by( gender ) |> 
    summarize( meanPulseBefore=mean(pulse1, na.rm=TRUE), sdPulseBefore=sd(pulse1, na.rm=TRUE) )

# 1 : alternative solution
d |>                               
    filter( !is.na(pulse1) ) |>
    group_by( gender ) |>
    summarize( meanPulseBefore=mean(pulse1), sdPulseBefore=sd(pulse1) )

# 2
d |> count( year )

# 3
d |> 
    group_by( year ) |> 
    summarize( studentsNum=n(), missingExerciseNum=sum( is.na(exercise) ) )

# 4
d |>
    filter( !is.na(pulse2) ) |>
    group_by( gender, ran ) |>
    summarize( minPulse=min(pulse2), medianPulse=median(pulse2), maxPulse=max(pulse2) )

# 5
pulse |> group_by(exercise) |>  summarise(missings=sum(is.na(age)))

# 6
pulse |> drop_na(gender, exercise) |> # remove rows with NA in gender and exercise
  group_by(gender) |> arrange(height) |>  slice_head(n=3)  |> 
  arrange(exercise) |> 
  ungroup()

Reshape tables (lecture) 👨‍🏫

Files: pulseNA.csv, survey.csv.
Needs: “pulseNA.csv”, “survey.csv”, 🔴tidyverse.

Two main functions to manipulate the layout of the table, pivot_longer transforms the table from wide to long format and pivot_wider does the opposite, converts the table from long to wide format.

pivot_longer function allows collapsing ‘similar’ variables into one variable while guaranteeing the data set’s consistency. For example take the variables pulse1 and pulse2 in the following subset of the pulse data set:

pulses <- pulse  %>%  
      select(name,pulse1,pulse2)  %>% 
      head(3) 
pulses
# A tibble: 3 × 3
  name     pulse1 pulse2
  <chr>     <dbl>  <dbl>
1 Bonnie       86     88
2 Melanie      82    150
3 Consuelo     96    176

We can transform the table as such that all pulse values are under a single variable, say pulse:

# 'pulse1' and 'pulse2' will become categories in 'pulse' variable
dfLong <- pulses %>% 
            pivot_longer(c(pulse1, pulse2), names_to = "pulse", values_to = "level")
dfLong
# A tibble: 6 × 3
  name     pulse  level
  <chr>    <chr>  <dbl>
1 Bonnie   pulse1    86
2 Bonnie   pulse2    88
3 Melanie  pulse1    82
4 Melanie  pulse2   150
5 Consuelo pulse1    96
6 Consuelo pulse2   176

pivot_wider function does the opposite of pivot_longer, going from long to wide format:

dfWide <- dfLong %>% pivot_wider(names_from = "pulse", values_from = "level")
dfWide  # is identical to the original 'pulses' tibble

Plotting

Often reshaping of the data is needed for visualisation. For example we would like to compare pulse1 and pulse2 variables with a boxplot. You may plot pulse1 and pulse2 in separate plots one after the other:

# For illustration we will use the R package gridExtra for arranging plots on screen.
#
suppressMessages(library(gridExtra))

# pulse1 boxplot
bp1 <- ggplot(pulse |>  drop_na()) +  # remove missing
  aes(y=pulse1) +
  geom_boxplot()

# pulse2 boxplot
bp2 <- ggplot(pulse |>  drop_na()) +  # remove missing
  aes(y=pulse2) +
  geom_boxplot()

grid.arrange(bp1, bp2, ncol=2)

As you can see the y-scale of each plot is set dynamically and can be misleading. To resolve this, we would need a single plot with aesthetic mapping of x being the categorical variable pulse and y the pulse values as was shown above.

ggplot(pulse |>  
         drop_na()  |>   # remove missing 
         pivot_longer( c(pulse1,pulse2), names_to = "pulse", values_to = "level")
       ) +  
  aes(x=pulse, y=level) +
  geom_boxplot()

Cross tables

Another situation where pivoting plays a role is with cross-tabulation. Cross-tables are used to display the distribution of two or more categorical variables in a compact way. Take the following count table of age,gender groups:

age_sex_group_counts  <- tibble(sex = c("female", "female", "female", "male", "male", "male", "male"),
       age = c(18, 19, 20, 18, 19, 21, 22), 
       count = c(1, 1, 2, 1, 2, 2, 1) )
age_sex_group_counts
# A tibble: 7 × 3
  sex      age count
  <chr>  <dbl> <dbl>
1 female    18     1
2 female    19     1
3 female    20     2
4 male      18     1
5 male      19     2
6 male      21     2
7 male      22     1

Here for each age group there are two entries, one for male and one for female, when there was count data available. A better way to represent the data is shown below where sex categories are set as separate columns:

age_sex_group_counts |>  pivot_wider(names_from = sex, values_from = count)
# A tibble: 5 × 3
    age female  male
  <dbl>  <dbl> <dbl>
1    18      1     1
2    19      1     2
3    20      2    NA
4    21     NA     2
5    22     NA     1

Reshape tables (exercise) 📖

Files: pulseNA.csv, survey.csv.
Needs: 🔴“LcSummarise”, “survey.csv”, “pulseNA.csv”, 🔴“LcMutate”, 🔴tidyverse.

  1. With the count function, part of tidyverse, you can calculate the category frequencies in a variable or multiples variables, see ?count for details. In the survey dataset produce the following cross-tables:
  1. gender as rows and exercise as columns
  2. exercise as rows and gender as columns
  3. smokes as rows and exercise as columns
  4. Can you find other intersting combinations?
  1. Produce the following count summary per gender from variables fold and clap combined in the survey dataset.
# A tibble: 2 × 4
  gender  left neither right
  <chr>  <int>   <int> <int>
1 female    67      30   137
2 male      66      37   129
  1. From pulse dataset reproduce the boxplot shown below.

Solution…
# 1) 

survey <- read_csv("survey.csv", show_col_types = FALSE)
# 1.a) 
survey |> 
  count(gender,exercise) |> 
  pivot_wider(names_from = exercise, values_from = n)
# 1.b)
survey |> 
  count(gender,exercise) |> 
  pivot_wider(names_from = gender, values_from = n)
# 1.c)
survey |> 
  count(exercise,smokes) |> 
  pivot_wider(names_from = exercise, values_from = n)

# 1.d)
# {gender, clap}, {gender, hand}, etc. 



# 2
survey |>  
  select(gender,fold,clap) |> 
  pivot_longer(!gender, names_to = "action", values_to = "side") |>  
  count(gender,side, name = "count") |> 
  pivot_wider(names_from = "side",values_from = "count")

# 3
ggplot(pulse %>% 
        drop_na() %>% 
        mutate(exercise=fct_relevel(exercise, c("low","moderate","high"))) %>%
        pivot_longer(c(pulse1,pulse2), names_to = "pulse",values_to = "level") ) + 
  aes(x=exercise, y=level, color=pulse) + geom_boxplot()

Real Data Analysis: Example data analysis problem (exercise) 📖

Files: rda-NL.rds, rda-UK.csv, rda-PL.tsv, rda-US.xlsx, rda-genes.csv.

Task concept

This task demonstrates issues arising during analysis of real-world data.

We have artificially generated several tables with data and we introduced various problems in them. Identify the problems and decide how to handle them (e.g. correct, ignore, …).
At the end of the course, the introduced problems will be discussed.

It is intentional, that the description is incomplete, and the task is not fully specified.
Use your knowledge, creativity and search skills to solve the task.

Example research question

Let’s assume that there are two simple research goals:

  • Study the distribution of BMI in different populations.
  • Find whether BMI is related to expression of some genes.

Several files with anthropometric data (known to be generated in different labs in different countries) are provided.
Moreover, there is a file with gene expression data for some of the subjects from all countries.

Deliverables:

  • A single, properly merged dataset containing all provided datasets (after cleaning and unification).
  • A report demonstrating the properties of the data, the distributions of BMI in different populations and the relations of BMI to expressions of each of the genes.

Click the links above (after Files:), to download the data files.

Organisation of files/directories

This task should be developed in a completely new RStudio project.

In the project directory:

  • There should be a subdirectory called data:
    • All input files with data should be placed in this directory.
    • The files should never be modified in any way.
  • It is convenient to have a subdirectory outs:
    • All automatically generated output files should be placed in this directory.
    • Separate directory allows easy removal of all generated files, in case of need.
  • Provide an R Markdown script producing the report and generating the combined dataset.
  • Provide a README.md file with a brief description of the project.

Sampling age distribution and age categories (exercise) 📖

Needs: sample.
Gives: cut, table, sample.
Practices: c, barplot, names, sample, hist, quantile, cut, table.

From the Eurostat database we extracted estimated numbers of people of each age (range 0-99) living in the Netherlands at some recent timepoint:

age2peopleNum <- c( 
    168269, 170352, 170475, 172564, 175965, 175404, 180139, 176908, 181784,
    185590, 190635, 191482, 192005, 188439, 191716, 193571, 199395, 206526,
    212170, 220073, 227052, 223141, 222346, 217582, 217364, 218670, 225921,
    226586, 229437, 232336, 233373, 225765, 223129, 222512, 221707, 216407,
    212812, 206718, 207274, 210873, 213552, 206261, 206100, 203072, 204502,
    205704, 214821, 221789, 239466, 250796, 261082, 266183, 254782, 251611,
    252467, 255486, 259306, 254400, 248828, 245545, 237918, 234688, 227404,
    221405, 216249, 209604, 205025, 200158, 198908, 190078, 189246, 189135,
    192126, 198903, 201625, 140595, 143352, 132521, 118605, 108982, 107050,
    98198, 90976, 79680, 73223, 65613, 58404, 50597, 45188, 38009, 32204,
    25186, 20334, 15452, 11867, 8684, 6348, 4353, 2924, 1837
)
age2peopleNum[[1]]  # 168269 people of age 0-1
age2peopleNum[[2]]  # 170352 people of age 1-2, etc.

Distribution

  • Produce a barplot of the age2peopleNum distribution of the number of people by age.
  • For better visualisation, add names to the vector age2peopleNum, so ages get visible in the barplot.

Sampling

  • Find how to use the prob argument of the sample function and generate a sample of 10000 ages from the distributed according to age2peopleNum. Store the sample in the variable as.
  • Use hist to show the histogram of the generated ages. The shape should be similar to the barplot.
  • What ages correspond to quantiles 0.1, 0.25, 0.5, 0.75, 0.9 of the sampled ages (hint: quantile)?

Building age categories

  • Find out how to use the cut function on sampled ages to create a factor with the following age categories: infant (age below 1), child (1<=age<10), teenager (10<=age<18), adult (18<=age<67), senior (67<=age). Store the factor in the variable fs.
  • Use table(fs, as) to count the number of people in each category and age: double check whether the assigned categories are correct (consider right=FALSE argument to cut).
Solution…
barplot( age2peopleNum )

ages <- 0:(length(age2peopleNum)-1)
names( age2peopleNum ) <- ages
barplot( age2peopleNum )

age2prob <- age2peopleNum / sum(age2peopleNum)
as <- sample( ages, size=10000L, replace=TRUE, prob=age2prob )
hist( as )
quantile( as, c(0.1, 0.25, 0.5, 0.75, 0.9) )

fs <- cut( as, breaks=c(0, 1, 10, 18, 67, 100), labels=c("infant", "child", "teenager", "adult", "senior"), right=FALSE )
table(fs, as)

Grades report, set operations (exercise) 📖

Needs: “Vectors”.
Gives: intersect, setdiff, union, “SetOperations”.
Practices: c, mean, names, “[…]”, intersect, setdiff, union.

Learn set operations

Study the following code to understand R set operations:

x <- 1:4
y <- 3:6

intersect(x,y)        # element in all arguments
union(x,y)            # element in any argument
setdiff(x,y)          # element in x but not in y
setdiff(y,x)          # element in y but not in x

Use set operations on named vectors

Alex has the following grades on several subjects:

|biology | 7.0|
|history | 6.0|
|math    | 9.0|
|music   | 7.5|
|physics | 8.0|

Hai, among other subjects, has taken similar subjects as Alex:

|biology   | 8.0|
|computing | 9.0|
|english   | 7.5|
|history   | 7.0|
|math      | 9.0|

Create the named numerical vectors gradesAlex and gradesHai from the tables above.

  • Print both vectors (so the subjects and grades are visible).
  • For each student what is the mean grade over subjects common to both?
  • What subjects has Alex taken that Hai has not and vice versa?
  • Show the distinct set of subjects taken by Alex and Hai.
Solution…
gradesAlex <- c(biology=7.0, history=6.0, math=9.0, music=7.5, physics=8.0)
gradesHai  <- c(biology=8.0, computing=9.0, english=7.5, history=7.0, math=9.0)

gradesAlex
gradesHai

# Manually it could be done as follows:
common_subjects <- c("biology", "history", "math")
# But we prefer use of 'intersect':

common_subjects <- intersect(names(gradesAlex),names(gradesHai))
mean(gradesAlex[common_subjects]) 
mean(gradesHai[common_subjects])

setdiff(names(gradesAlex),names(gradesHai))
setdiff(names(gradesHai),names(gradesAlex))
union(names(gradesAlex),names(gradesHai))

Data structures 2/2, Flow control

Lists (lecture) 👨‍🏫

Needs: “Vectors”, length, str, names.
Gives: “Lists”, list, “[[…]]”, names, length, str, “NULL”.

A list can be thought of as zero or more data elements (of any type), kept in order.
The elements can be accessed by their position (or index): first element, second element, …
Typically the elements are also named and consequently can be also accessed by the names.

A list allows to keep together related information.
For example, functions often return lists when their result consists of many different parts.

Create a list

Type the following code to manualy create a list and store it in a variable person:

person <- list(
  name = "Bob",                        # character vector, length 1
  age = 44,                            # numeric vector, length 1
  children = c( "Amy", "Dan", "Eve" )  # character vector, length 3
)
person
$name
[1] "Bob"

$age
[1] 44

$children
[1] "Amy" "Dan" "Eve"

List properties

The function str (structure) gives a compact display of data stored in a list:

str( person )
List of 3
 $ name    : chr "Bob"
 $ age     : num 44
 $ children: chr [1:3] "Amy" "Dan" "Eve"

Use length to find the number of elements in a list:

length( person )
[1] 3

To get the names of the elements of a list, use the names function:

names( person )
[1] "name"     "age"      "children"

Single element by position

An elements of a list can be accessed by its position (first, second, …) with help of [[...]] (double bracket) operator.
Try the following to get the first and the last elements:

person[[1]]
[1] "Bob"
person[[ length(person) ]]
[1] "Amy" "Dan" "Eve"

You may also set the element by position:

person[[1]] <- "Alice"

Observe, that the elements of the person list returned by [[...]] are of different classes:

str( person[[1]] )
 chr "Alice"
str( person[[2]] )
 num 44

Single element by name

Each of the names can be used with the [[...]] operator to get/set an element by name:

person[[ 'name' ]]
[1] "Alice"
person[[ 'name' ]] <- "Hans-Peter"
person[[ 'name' ]]
[1] "Hans-Peter"
person[[ 'children' ]]
[1] "Amy" "Dan" "Eve"

Subsetting a list, many elements

Use the single bracket [...] operator to subset a list.
Here, not an element is selected, but a new list is created with selected elements:

person[ "name" ]                          # a list with one element
$name
[1] "Hans-Peter"
selElems <- c( "name", "children" )
person[ selElems ]                        # a list with two elements
$name
[1] "Hans-Peter"

$children
[1] "Amy" "Dan" "Eve"

Adding and removing elements

Add a new element as follow:

person[[ 'smoker' ]] <- TRUE
person
$name
[1] "Hans-Peter"

$age
[1] 44

$children
[1] "Amy" "Dan" "Eve"

$smoker
[1] TRUE

Or, change the element:

person[[ 'smoker' ]] <- FALSE
person
$name
[1] "Hans-Peter"

$age
[1] 44

$children
[1] "Amy" "Dan" "Eve"

$smoker
[1] FALSE

To remove an element from the list, assign NULL (nothing) to it:

person[[ 'smoker' ]] <- NULL
person
$name
[1] "Hans-Peter"

$age
[1] 44

$children
[1] "Amy" "Dan" "Eve"

Note what happens when you try to access a not present element:

person[[ 'isMarried' ]]
NULL

Grades of students (exercise) 📖

Needs: “Lists”, “Vectors”, NA, length, min, mean, “[…]”, “[[…]]”, “NULL”.
Practices: length, “[…]”, “[[…]]”, min, mean, “NULL”, list.

There are four students: Alice, Bob, Charlie, and Daisy. Each student has several grades, and the number of grades is different for each student:

  • Alice: 8, 9, 7.5, 8, 9
  • Bob: 6, 7, 8, 7, 6, 5, 5, 8
  • Charlie: 9, 7, 9
  • Daisy: 8, ?, 6.5, 6.5, 7, 8, 7 (one grade is missing)

Create and inspect the list

Build a list grades with the grades of the students, then inspect it:

  • How many students are in the list?
  • What are the names of the students?
  • Retrieve Bob’s grades.
  • How many grades does Charlie have?
  • What are the mean grades of Bob and Daisy?
  • What is Alice’s lowest grade?
  • Alice and Daisy work in a group “A” and Bob and Charlie in a group “B”. Create separate lists for these groups in variables groupA and groupB.
Solution…
grades <- list(
    Alice = c( 8, 9, 7.5, 8, 9 ),
    Bob = c( 6, 7, 8, 7, 6, 5, 5, 8 ),
    Charlie = c( 9, 7, 9 ),
    Daisy = c( 8, NA, 6.5, 6.5, 7, 8, 7 )
)

length(grades)
names(grades)
grades[['Bob']]
length(grades[['Charlie']])
mean(grades[['Bob']], na.rm=TRUE)
mean(grades[['Daisy']], na.rm=TRUE)
min(grades[['Alice']], na.rm=TRUE)
groupA <- grades[c('Alice', 'Daisy')]
groupB <- grades[c('Bob', 'Charlie')]

Modify the list

  • Add Eve with grades: 7.7, 8.5, 9.
  • Add Charlie’s new grades: 8.5 and 9.
  • Bob has moved to another school. Remove Bob.
  • The missing grade of Daisy has been found. Replace the missing with the new grade 7.
Solution…
grades[['Eve']] <- c(7.7, 8.5, 9)
grades[['Charlie']] <- c(grades[['Charlie']], 8.5,9)
grades[['Bob']] <- NULL
grades[['Daisy']][2] <- 7

T-test returns a list (exercise) 📖

Needs: rnorm, mean, sd, “Lists”.
Gives: t.test.
Practices: rnorm, mean, sd, t.test, names.

The t-test is a statistical test used to determine if there is a significant difference between the means of two samples. (Here we will use the two-sample t-test, which is used to compare the means of two independent samples. There are some assumptions about the samples which must be met, but we do not discuss them here.)

Generate two samples from the same normal distribution

  • Generate two vectors x and y sampled from the same normal distribution (rnorm with the same mean and the same standard deviation). The vectors should have n elements each (n=30).
  • Check the mean and sd (standard deviation) of the generated samples.

Perform the t-test

  • Find out how to use t.test to perform the t-test on the x and y vectors. Store the result in a variable h.
  • The t.test function returns a list with the results of the test. Investigate the list (use str and/or names) and extract the difference estimate of the means and the p.value.
  • Interpret the p.value. Since the x and y samples originate from the same distribution, in most cases the t-test should not reject the null hypothesis that the means are equal. Does the observed p.value confirm this?

Repeat sampling and testing

  • Repeat manually all steps several times. Will the p.value be always above 0.05 or you observe a false positive?

Generate two samples from distributions with different means

  • Finally, generate the x and y vectors with clearly different means (but the same standard deviation).
  • Repeat the t-test. Is the result significant or you observe a false negative?
Solution…
n <- 30
x <- rnorm( n = n, mean = 0, sd = 1 )
y <- rnorm( n = n, mean = 0, sd = 1 )
mean(x)
sd(x)
mean(y)
sd(y)

h <- t.test( x, y )
names(h)
h$estimate
h$p.value

Matrices (lecture) 👩‍🏫

Needs: “Vectors”, length, str, names.
Gives: “Matrices”, matrix, “[.,.]”, nrow, ncol, dim, colnames, rownames, rowMeans, rowSums, colMeans, colSums, t, “%*%“.

A matrix is a two-dimensional, rectangular structure with all elements of the same type.
The elements can be accessed by their two-dimensional position: [row, column].
The rows and columns of the matrix might additionally be named and consequently can be also accessed by a pair [rowname, colname].

Create a matrix

A matrix can be manually created from a vector.
Depending on the arguments, elements of the vector are put into the matrix in a different order (column-by-column or row-by-row):

m <- matrix( 1:6, nrow = 2 )               # column-by-column
m
     [,1] [,2] [,3]
[1,]    1    3    5
[2,]    2    4    6
m <- matrix( 1:6, nrow = 2, byrow = TRUE ) # row-by-row
m
     [,1] [,2] [,3]
[1,]    1    2    3
[2,]    4    5    6

Test the class of m:

class( m )
[1] "matrix" "array" 

Try the function str to get a compact summary of matrix content:

str( m )
 int [1:2, 1:3] 1 4 2 5 3 6

Dimensions

There are several functions to get dimensions of a matrix.

Number of columns is provided by:

ncol( m )
[1] 3

Number of rows is provided by:

nrow( m )
[1] 2

Dimensions (two elements: number of rows and number of columns):

dim( m )
[1] 2 3

Names of the columns and rows

colnames( m ) and rownames( m ) are used to set and get the vectors with columns and rows names.

Try to set the names:

m <- matrix( 1:6, nrow = 3 )
m
     [,1] [,2]
[1,]    1    4
[2,]    2    5
[3,]    3    6
colnames( m ) <- c( "A", "B" )
m
     A B
[1,] 1 4
[2,] 2 5
[3,] 3 6
rownames( m ) <- c( "X", "Y", "Z" )
m
  A B
X 1 4
Y 2 5
Z 3 6

To get the names use:

[1] "X" "Y" "Z"
[1] "A" "B"

Accessing matrix elements

Use single brackets notation with two arguments [ row(s), col(s) ] to get/set elements.
Note, when getting only a single row or a single column the result is returned as a vector, not as a matrix.

m[ 3, 1 ]             # a single element as a vector of length 1
[1] 3
m[ 3, ]               # a single row as a vector
A B 
3 6 
m[ , 1 ]              # a single column as a vector
X Y Z 
1 2 3 
m[ c( 2, 3 ), 1 ]     # two rows/one column, still a vector
Y Z 
2 3 
m[ c( 2, 3 ), c( "B", "A" ) ]     # a matrix 2x2
  B A
Y 5 2
Z 6 3
m[ c( F, T, T ), c( "B", "A" ) ]  # a matrix 2x2 (rows 2 and 3 corresponding to TRUE)
  B A
Y 5 2
Z 6 3

Notice the difference in the output class when only a single element is requested:

class( m[ 3, 1 ] )
[1] "integer"
class( m[ c( 2, 3 ), c( "B", "A" ) ] )
[1] "matrix" "array" 

With the drop argument you may enforce that always a matrix is returned (never a vector):

m[ 3, 1 ]                 # a vector
[1] 3
m[ 3, 1, drop = FALSE ]   # a matrix
  A
Z 3
class( m[ 3, 1, drop = FALSE ] )
[1] "matrix" "array" 

Empty row/column argument field means all rows and/or all columns:

m[ , c( "B", "A" ) ]       # all rows, selected columns
  B A
X 4 1
Y 5 2
Z 6 3
m[ c( "Z", "X", "Y" ), ]   # selected rows, all columns
  A B
Z 3 6
X 1 4
Y 2 5
m[ , ]                     # all rows and columns, same as just "m"
  A B
X 1 4
Y 2 5
Z 3 6

Useful matrix functions

Short summary of matrix operations: http://www.statmethods.net/advstats/matrix.html

Transposition

m <- matrix( 1:6, nrow = 3 )
colnames( m ) <- c( "A", "B" )
rownames( m ) <- c( "X", "Y", "Z" )
m
  A B
X 1 4
Y 2 5
Z 3 6
t(m)
  X Y Z
A 1 2 3
B 4 5 6

Matrix multiplication

m
  A B
X 1 4
Y 2 5
Z 3 6
t( m )
  X Y Z
A 1 2 3
B 4 5 6
m %*% t( m )
   X  Y  Z
X 17 22 27
Y 22 29 36
Z 27 36 45

Element-wise multiplication

m
  A B
X 1 4
Y 2 5
Z 3 6
m+1
  A B
X 2 5
Y 3 6
Z 4 7
m * (m+1)
   A  B
X  2 20
Y  6 30
Z 12 42

Row/columns means/sums

m
  A B
X 1 4
Y 2 5
Z 3 6
  X   Y   Z 
2.5 3.5 4.5 
rowSums( m )
X Y Z 
5 7 9 
A B 
2 5 
colSums( m )
 A  B 
 6 15 

Correlation heatmaps (exercise) 📖

Needs: “Matrices”.
Gives: cor, cbind, heatmap.
Practices: rnorm, cor, cbind, heatmap.

Correlated vectors

Let’s define a parameter size <- 12. Later this will be the number of rows of the matrix.
From the normal distribution let’s sample a random vector: x <- rnorm( size ).
Now, let’s create another vector x1 by adding some noise to x: x1 <- x + rnorm( size )/10.
The noise is “small”, so the correlation coefficient of x and x1 should be close to 1.0: check this with function cor.
Finally, in a similar way create vectors x2 and x3 by adding (other) small noise to x.

Solution…
size <- 12
x <- rnorm( size )
x1 <- x + rnorm( size )/10
cor( x, x1 )
x2 <- x + rnorm( size )/10
x3 <- x + rnorm( size )/10

Matrix from vectors; matrix heatmap

Let’s merge x1, x2 and x3 column-wise into a matrix using m <- cbind( x1, x2, x3 ).
Check the class of m.
Show the first few rows of m.
Try a simple heatmap visualisation of the matrix: heatmap( m, Colv = NA, Rowv = NA, scale = "none" ).
How are the vectors shown?
Which colors correspond to the lowest/highest matrix values?
Do the vectors appear correlated?

Solution…
m <- cbind( x1, x2, x3 )
class( m )
head( m )
heatmap( m, Colv = NA, Rowv = NA, scale = "none" ) # high is dark red, low is yellow
                # x1, x2, x3 follow similar color pattern, they should be correlated

Matrix of correlated and uncorrelated vectors

Repeat the first exercise and create several other vectors y1y4 correlated to each other, but not correlated with x.
Each vector should have the same length size.
Now build again m from the columns x1x3,y1y4 (in some random order).
Show again the heatmap; you should see similarity between some columns.

Solution…
y <- rnorm( size )
y1 <- y + rnorm( size )/10
y2 <- y + rnorm( size )/10
y3 <- y + rnorm( size )/10
y4 <- y + rnorm( size )/10
m <- cbind( y4, y3, x2, y1, x1, x3, y2 )
heatmap( m, Colv = NA, Rowv = NA, scale = "none" ) # high is dark red, low is yellow

Matrix of correlations

Use cc <- cor( m ) to build the matrix of correlation coefficients between the columns of m.
Use round( cc, 3 ) to show this matrix with 3 digits precision.
Can you interpret the values?

Solution…
cc <- cor( m )
round( cc, 3 )

Here is an example, how the heatmap of correlations could look like (of course, for your randomly sampled data it will look slighly different):

Solution…
size <- 12

# --- x and y have correlation around 0 ---
x <- rnorm( size )
y <- rnorm( size )

# --- x1, x2, x3 are correlated to each other ---
x1 <- x + rnorm( size )/10
x2 <- x + rnorm( size )/10
x3 <- x + rnorm( size )/10

# --- y1...y4 are correlated to each other ---
y1 <- y + rnorm( size )/10
y2 <- y + rnorm( size )/10
y3 <- y + rnorm( size )/10
y4 <- y + rnorm( size )/10

# --- matrix of all vectors ---
m <- cbind( y4, y3, x2, y1, x1, x3, y2 )

# --- calculating correlations of all matrix column pairs ---
cc <- cor( m )

# --- visualisation of correlations ---
heatmap( cc, symm = TRUE, scale = "none" )

Solution…
# E.g. value for (row: x1, col: y1) is the corerlation of vectors x1, y1.
# Values of 1.0 are on the diagonal: e.g. x1 is perfectly correlated with x1.
# Correlations between x, x vectors are close to 1.0.
# Correlations between y, y vectors are close to 1.0.
# Correlations between x, y vectors are close to 0.0.

Modelling weight from height, linear model (exercise) 📖

Needs: “pulseNA.csv”.
Gives: lm, summary, “Formulas”, plot, abline, “~”.
Practices: lm, summary, plot, abline, names, “~”.

Formula objects

Let’s assume that it is acceptable to use a linear model to model the weight of a person based on their height. R uses the formula notation to express the relationship between variables: weight ~ height. The formula is an abstract representation of the relationship between the variables. No calculations is performed at this stage.

height ~ weight           # height depends on weight
height ~ weight
height ~ weight + gender  # height depends on weight and gender
height ~ weight + gender
form <- weight ~ height   # formulas are objects; can be stored in variables
form
weight ~ height

Fitting a linear model: weight ~ height

  • Load the pulseNA.csv dataset.
  • Plot the weight against the height using the plot function.
  • Fit out how to fit a linear model to the data with the lm function. Use the formula weight ~ height. Store the result in a variable h. Observe what is printed when you print the h object.
  • Use the summary function to get a further summary of the linear model. Store the result in a variable sumH. Print it.
  • Plot again, use plot( weight ~ height, data = d ) + abline(h, col="red").
  • Treat sumH as a list and find out what are the names of the elements of the sumH object. Print the coefficients and adj.r.squared elements. Compare their values with the printed output of h and sumH.
  • Can you interpret the result?
Solution…
d <- read.csv( "pulseNA.csv" )
form <- weight ~ height
plot( form, data = d ) # formula notation is also allowed here
h <- lm( form, data = d )
plot( weight ~ height, data = d ) + abline(h, col="red")
h
sumH <- summary(h)
sumH
names(sumH)
sumH$coefficients
sumH$adj.r.squared

Flow control: if/else, for (lecture) 👩‍🏫

Needs: “Vectors”, sample, intersect, names, sort.
Gives: “if”, “else”, “for”, “break”, “next”, seq_along, seq_len.

Conditional execution: if, else

The if statement allows to execute a block of code only if a condition is met. After if there can be an optional else statement with another block of code to execute if the if condition has not been met.

Let’s consider a simple example of a lottery bet:

myBet <- c( 1, 24, 26, 32, 36, 40 )               # 6 numbers from 1 to 45
lotteryDraw <- sort( sample( 1:45, 6 ) )          # lottery draw
lotteryDraw
[1]  1 10 18 25 36 37
intersect( myBet, lotteryDraw )                   # matched numbers
[1]  1 36
num <- length( intersect( myBet, lotteryDraw ) )  # number of matched numbers
num
[1] 2

The following code prints some text only if at least one number matched:

if( num > 0 ) {
  print( "You matched at least one!" )
}
[1] "You matched at least one!"

Additionally, a different text is printed if no number matched:

if( num == 0 ) {
  print( "You lost!" )
} else {
  print( "You matched at least one!" )
}
[1] "You matched at least one!"

The following code prints different text depending on the number of matched numbers:

if( num >= 3 ) {
  print( "WOW, you matched at least three!" )
} else if( num == 0 ) {
  print( "You lost!" )
} else {
  print( "There is a match, but of low value." )
}
[1] "There is a match, but of low value."

Looping: for

A for loop repeats a block of code for each element of an iterable object (e.g. vector or list), in the same order.

A typical use-case is to loop over elements of a vector (or a list):

vs <- c( a=37, b=44, c=21, d=63, e=15 )

sum <- 0
for( v in vs ) {            # loop over elements of vs; works always
  sum <- sum + v            # exectued many times for v being each element of vs
}
sum
[1] 180

Another way is to loop over indices of a vector (or a list). Use seq_along to get a sequence of indices (it works correctly for the empty vector/list):

sum <- 0
for( i in seq_along(vs) ) { # loop over indices of vs; works always
  sum <- sum + vs[[i]]      # executed many times for i being each index of vs
}
sum
[1] 180

For objects with named elements, it is possible to loop over names:

sum <- 0
for( n in names(vs) ) {     # loop over names of vs elements; fails if no names
  sum <- sum + vs[[n]]      # executed many times for n being each name of vs
}
sum
[1] 180

To simply repeat a block of code a fixed number of times, use seq_len:

repeatNum <- 4              # seq_len(4) is c(1,2,3,4), seq_len(0) is empty
for( i in seq_len(repeatNum) ) {
  print( i )                # 1:4 id c(1,2,3,4), but 1:0 is NOT empty!
}
[1] 1
[1] 2
[1] 3
[1] 4

A loop might be broken with the break statement:

vs <- c( 18, 33, NA, 61 )
anyMissing <- FALSE
for ( v in vs ) {
  if( is.na(v) ) {
    anyMissing <- TRUE
    print( "Missing found" )
    break
  }
  print( v )
}
[1] 18
[1] 33
[1] "Missing found"
anyMissing
[1] TRUE

To skip the rest of the loop and continue with the next iteration, use the next statement:

vs <- c( 18, 33, -1, 61, -1 )
for ( v in vs ) {
  if( v < 0 ) {
    print( "Not printing negatives" )
    next
  }
  print( v )
}
[1] 18
[1] 33
[1] "Not printing negatives"
[1] 61
[1] "Not printing negatives"

Flow control: functions (lecture) 👩‍🏫

Needs: “if”, “else”, sum, exp, 🔴log, length.
Gives: “function”, return, stopifnot, all.equal.

Function definition

Functions allow to encapsulate a block of code and execute (call) it multiple times. The name of the function allows to represent the block of code with a single word. The arguments of the function allow to pass different values to the block of code. The return value of the function allows to pass a single value back to the caller.

ageGroup <- function(age) { # define function `ageGroup`, one argument `age`
  if( age < 18 ) {          # function body...
    return( "child" )       #     ...return text "child" if age is less than 18
  } else if( age < 65 ) {
    return( "adult" )       #     ...return text "adult" if age is less than 65
  } else {
    return( "senior" )      #     ...return text "senior" otherwise
  }                         # ...end of function body
}

ageGroup( 12 )              # call function `ageGroup` with first argument 12
[1] "child"
ageGroup( age=35 )          # call function `ageGroup` with `age` argument 35
[1] "adult"

Testing functions

When we develop a function, we always first test it. The stopifnot function allows to replace manual testing and save the tests for future (when potentially the function gets incorectly modified). The stopifnot function produces an error if the expression in its argument is not TRUE:

stopifnot( ageGroup( 12 ) == "child" )
stopifnot( ageGroup( age=35 ) == "adult" )
# stopifnot( ageGroup( age=35 ) == "ADULT" )   # test to see the error message

Default arguments

A function might have default arguments. The caller can omit the value of a default argument, and the default value will be used instead. Let’s define our own function to calculate the mean of a vector of numbers. The function will have an argument isGeometric to allow to calculate the geometric mean instead of the algebraic mean:

myMean <- function( vs, isGeometric=FALSE ) {
  if( isGeometric ) {
    res <- exp( sum(log(vs)) / length(vs) )
  } else {
    res <- sum(vs)/length(vs)
  }
  res
}

myMean( vs=c(2,4,8), isGeometric=FALSE )      # algebraic mean
[1] 4.666667
myMean( vs=c(2,4,8), isGeometric=TRUE )       # geometric mean
[1] 4
myMean( vs=c(2,4,8) )                         # algebraic mean, default arg
[1] 4.666667
myMean( c(2,4,8), TRUE )                      # arguments matched by position
[1] 4
myMean( isGeometric=TRUE, vs=c(2,4,8) )       # arguments matched by name
[1] 4

When writing tests for equality of numerical results, it is better to use all.equal (close numerical result up to a certain tolerance) instead of == (exact equality):

stopifnot( myMean( c(1,2,3,4,5) ) == 3 )        # works well
#stopifnot( myMean( c(1,10,100), TRUE ) == 10 ) # fails, precision problem
stopifnot( all.equal( myMean( c(1,10,100), TRUE ), 10 ) ) # works well

Printing patterns (exercise) 📖

Needs: “for”, “if”, “else”, seq_len.
Gives: cat, stop.
Practices: “for”, “if”, “else”, cat, stop.

Understand, how the following code draws a rectangle from * symbols:

rowsNum <- 4
colsNum <- 15
fillChar <- "*"
for( r in seq_len(rowsNum) ) {
  for( c in seq_len(colsNum) ) {
    cat( fillChar )                 # prints a single character
  }
  cat( "\n" )                       # prints a newline character
}
***************
***************
***************
***************

Encapsulate the code in a function

Define a function catFilledRect. The function should print a rectangle as shown above. The size of the rectangle should be provided as function arguments rowsNum and colsNum. The fill character should be provided as the third argument fillChar (with a default value of *).

Below you can see some examples of the expected output.

Solution…
catFilledRect <- function( rowsNum, colsNum, fillChar="*" ) {
  for( r in seq_len(rowsNum) ) {
    for( c in seq_len(colsNum) ) {
      cat( fillChar )
    }
    cat( "\n" )
  }
}
catFilledRect( rowsNum=3, colsNum=15, fillChar="X" )
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
XXXXXXXXXXXXXXX
catFilledRect( 4, 2 )
**
**
**
**

Reuse one function in another

Write a function catSquare that prints a square of a given size. The function should use the catFilledRect function with appropriate arguments.

The function should work as follows:

Solution…
catSquare <- function( size, fillChar="*" ) {
  catFilledRect( rowsNum=size, colsNum=size, fillChar=fillChar )
}
catSquare( size=4, fillChar="X" )    # prints 4x4 square filled with X
XXXX
XXXX
XXXX
XXXX
catSquare( size=2 )                  # prints 2x2 square filled with *
**
**

Add an error stop for nonsense arguments

Find out how to use stop to produce an error when a negative size is provided. Rewrite catSquare to produce the error as follows:

Solution…
catSquare <- function( size, fillChar="*" ) {
  if( size < 0 ) {
    stop( "Size must be non-negative!" )
  }
  catFilledRect( rowsNum=size, colsNum=size, fillChar=fillChar )
}
catSquare( size=-1, fillChar="X" )     # we print an own error for size < 0
Error in catSquare(size = -1, fillChar = "X"): Size must be non-negative!

Different shapes

Implement functions catRect, catBackslash, catSlash, and catX based on the examples shown below. The functions should print the shapes as shown. You may throw errors when arguments make no sense (not shown in the solutions).

Solution…
catRect <- function( rowsNum, colsNum, drawChar="*", fillChar="." ) {
  for( r in seq_len(rowsNum) ) {
    for( c in seq_len(colsNum) ) {
      if( r == 1 || r == rowsNum || c == 1 || c == colsNum ) {
        cat( drawChar )
      } else {
        cat( fillChar )
      }
    }
    cat( "\n" )
  }
}
catRect( rowsNum=5, colsNum=15, drawChar="*", fillChar="." )
***************
*.............*
*.............*
*.............*
***************
Solution…
catBackslash <- function( size, drawChar="*", fillChar="." ) {
  for( r in seq_len(size) ) {
    for( c in seq_len(size) ) {
      if( r == c ) {
        cat( drawChar )
      } else {
        cat( fillChar )
      }
    }
    cat( "\n" )
  }
}
catBackslash( size=6, drawChar="*", fillChar="." )
*.....
.*....
..*...
...*..
....*.
.....*
Solution…
catSlash <- function( size, drawChar="*", fillChar="." ) {
  for( r in seq_len(size) ) {
    for( c in seq_len(size) ) {
      if( r == size-c+1 ) {
        cat( drawChar )
      } else {
        cat( fillChar )
      }
    }
    cat( "\n" )
  }
}
catSlash( size=8, drawChar="*", fillChar="." )
.......*
......*.
.....*..
....*...
...*....
..*.....
.*......
*.......
Solution…
catX <- function( size, drawChar="*", fillChar="." ) {
  for( r in seq_len(size) ) {
    for( c in seq_len(size) ) {
      if( r == size-c+1 || r ==c ) {
        cat( drawChar )
      } else {
        cat( fillChar )
      }
    }
    cat( "\n" )
  }
}
catX( size=8, drawChar="*", fillChar="." )
*......*
.*....*.
..*..*..
...**...
...**...
..*..*..
.*....*.
*......*

Graphics

Introduction to ggplot2 (lecture) 👨‍🏫

Files: pulseNA.csv.
Needs: “tidyverse”, read_csv, “pulseNA.csv”, “|>”.
Gives: “ggplot2”, ggplot, geom_point, aes, drop_na.

The ggplot2 package (a part of the tidyverse library) is a powerful tool for creating visualizations in R. The package provides the grammar of graphics and allows to create complex plots with a simple syntax.
The main input is a table (a data.frame or a tibble) with the data to be plotted, and the main function is ggplot. A plot is built by adding layers to the plot object.

library(tidyverse) # or: library(ggplot2)
d <- read_csv( "pulseNA.csv" , show_col_types = FALSE )

plotData <- d |> drop_na( pulse1, pulse2, gender )
ggplot(plotData) + 
  geom_point( aes(x=pulse2, y=pulse1) )

The layer provided by geom_point is used to create a scatter plot. Points in a scatter plot have coordinates given by the x and y aesthetics. Moreover, the points have color, shape, and size aesthetics. When an aesthetic is listed in the aes function, it refers to a column of the table. When it is listed outside of the aes function, it is a constant value.

Below, the x and y aesthetics are taken from the pulse1 and pulse2 columns of the table, whereas the color aesthetic is set to a constant darkgreen (see help for the function colors for other colors).

ggplot(plotData) + 
  geom_point( aes(x=pulse1, y=pulse2), color="darkgreen" )

Below, the color aesthetic uses the ran column of the table. Some colors are automatically chosen to represent the levels of the ran column:

ggplot(plotData) + 
  geom_point( aes(x=pulse1, y=pulse2, color=ran) )

The size, shape, alpha (transparency level in 0..1 range) are other available aesthetics (see Aesthetic specifications).

ggplot(plotData) + 
  geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )

Customizing plot axes, colors, scales and dimensions (exercise) 📖

Files: pulseNA.csv.
Needs: “tidyverse”, “pulseNA.csv”, ggplot, geom_point, aes, “|>”, drop_na.
Gives: scale_color_manual, scale_shape_manual, xlab, ylab, ggsave, scale_color_continuous, scale_size_area, scale_x_continuous, scale_y_continuous, coord_fixed, geom_abline.
Practices: ggplot, aes, scale_color_manual, scale_shape_manual, xlab, ylab, ggsave, scale_color_continuous, scale_size_area, scale_x_continuous, scale_y_continuous, coord_fixed, geom_abline.

Categorical variables used for color and shape; axes labels

Start with a plot defined in the variable p as follows:

suppressPackageStartupMessages( library(tidyverse) )

plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE ) |>
  drop_na( pulse1, pulse2, gender )
p <- ggplot(plotData) + 
  geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )
p

Find, what needs to be further added to p to obtain the plot shown below:

  • The functions scale_color_manual and scale_shape_manual allow to customize the colors and shapes of the points used in the plot, when data is categorical. Their values argument is a named vector allowing to manualy enumerate shapes or colors (e.g. c("TRUE"="red", ...)). The name argument allows to specify the legend title (use \n to insert a line break).
  • The functions xlab and ylab allow to set the labels of the x and y axes.
Solution…
p <- p +
  scale_color_manual( values=c("TRUE"="red", "FALSE"="blue"), name="Did student\nrun?" ) +
  scale_shape_manual( values=c("female"=19, "male"=15), name="Gender of\nthe student" ) +
  xlab("Pulse before exercise [1/min]") +
  ylab("Pulse after exercise [1/min]")
p

Continuous variables used for color, size, axes

Now, start with a different plot:

p <- ggplot(plotData) + 
  geom_point( aes(x=pulse1, y=pulse2, color=weight, size=weight), alpha=0.5 )
p

Find, what needs to be further added to p to obtain the plot shown below:

  • The function scale_color_continuous allow to customize the colors representing a continuous variable. The low and high arguments specify the colors for the smallest and the largest values of the variable.
  • The function scale_size_area allows to customize the sizes of the points representing a non-negative continuous variable. The max_size argument specifies the size of the largest point.
  • The functions scale_x_continuous and scale_y_continuous allow to specify more details of x and y axes. The name argument specifies the title of the axis. The breaks argument specifies the positions of the ticks on the axis.
  • The function coord_fixed enforces the same physical size of a unit on both axes. This is useful, when the units of the x and y axes are the same.
  • The geom_abline(slope=1) function adds a diagonal line to the plot. Other arguments allow to specify the color and type of the line.
Solution…
brks <- seq(40,180,10)
p <- p +
  geom_abline( slope=1, intercept=0, color="black", linetype="dashed" ) +
  scale_color_continuous( low="blue", high="red", name="Weight [kg]" ) +
  scale_size_area( name="Weight [kg]", max_size=5 ) +
  scale_x_continuous(name="Pulse before exercise [1/min]", breaks=brks) +
  scale_y_continuous(name="Pulse after exercise [1/min]", breaks=brks) +
  coord_fixed()
p

Saving the plot to a file, changing dimensions

Given the plot in the variable p, find out how to use ggsave to save the plot to a file in a png format. Try to save the plot in two different image sizes.

Solution…
ggsave( filename="plot1.png", plot=p, width=6, height=4 ) # png format, small
ggsave( "plot2.png", pp, width=8, height=6 )              # png, larger
ggsave( filename="plot3.eps", plot=p, width=6, height=4 ) # different format

Moreover, find out what parameters of an R chunk need to be set in order to change the plot size when the plot is knitted. For example like here:

Solution…
# {r fig.width=5,fig.height=4} are the parameters of the R chunk used here
p

Plot types: histogram, boxplot, barplot (exercise) 📖

Files: pulseNA.csv.
Needs: “tidyverse”, “pulseNA.csv”, ggplot, geom_point, aes, factor, ylab.
Gives: geom_histogram, geom_boxplot, geom_bar, ggtitle, scale_x_continuous, geom_jitter, theme_dark, ylab, theme_bw.
Practices: ggplot, aes, geom_histogram, geom_boxplot, geom_bar, ggtitle, scale_x_continuous, geom_jitter, theme_dark, theme_bw, drop_na, ylab, factor.

Loading the data

Load the dataset as follows. Then find out how to recreate the following plots. Pay attention to the details. The solutions should include the functions/arguments/words listed at each point.

suppressPackageStartupMessages( library(tidyverse) )
plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE)

Histogram

In your solution use (among others): drop_na, geom_histogram, binwidth, boundary, fill, color, ggtitle, scale_x_continuous, breaks, seq, name. There should be no warning message.

Solution…
p <- ggplot(
    plotData |> drop_na( pulse1 )
  ) + 
  geom_histogram( 
    aes(x=pulse1), 
    binwidth=20, boundary=80, fill="white", color="black" 
  ) +
  ggtitle( "Histogram of pulse before exercise" ) +
  scale_x_continuous( breaks=seq(40,200,10), name="Pulse [1/min]" )
p

Boxplot

Use: geom_boxplot, geom_jitter, outlier.color, theme_dark, ylab.

Solution…
p <- ggplot(plotData) + 
  aes( x = gender, y = weight ) +
  geom_jitter( width=0.3, color = "orange" ) +
  geom_boxplot( fill=NA, color="white", outlier.color=NA ) +
  theme_dark() +
  ylab( "Weight [kg]" )
p

Barplot

Use: mutate, factor, levels, geom_bar, fill, color, alpha, theme_bw.

Solution…
p <- ggplot(
    plotData |> 
      mutate( exercise = factor( exercise, levels=c("low","moderate","high") ) )
  ) +
  geom_bar( aes(x=exercise), fill="gray", color="black", alpha=0.5 ) + 
  theme_bw()
p

Plots with panels (faceting) (exercise) 📖

Needs: ggplot, aes, geom_point, 🔴facrtor, drop_na, mutate.
Gives: facet_grid, facet_wrap.
Practices: facet_grid, facet_wrap, drop_na, mutate.

Start with the following code that reads the data into plotData and produces a scatter plot of the pulse rates in p:

suppressPackageStartupMessages( library(tidyverse) )
plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE) |>
  drop_na( exercise, gender, pulse1, pulse2 )

p <- ggplot(plotData) + 
  geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )

Discover how the facet_grid and facet_wrap functions can be used to create a plot with multiple panels:

p + facet_grid( . ~ exercise )
p + facet_grid( gender ~ . )
p + facet_grid( gender ~ exercise )
p + facet_wrap( ~ year )

Note, that the levels of exercise are not shown in the correct order. To correct that, change the plotData table: convert exercise to a factor with levels set to c("low","moderate","high"). Then, recreate the plots with exercise variable.

Solution…
p + facet_grid( . ~ exercise )       # exercise horizontally
p + facet_grid( gender ~ . )         # gender vertically
p + facet_grid( gender ~ exercise )  # gender vertically, exercise horizontally
p + facet_wrap( ~ year )             # year in multiple rows (like writing text)

plotData <- read_csv( "pulseNA.csv" , show_col_types = FALSE) |>
  drop_na( exercise, gender ) |>
  mutate( exercise = factor( exercise, levels=c("low","moderate","high") ) )
p <- ggplot(plotData) + 
  geom_point( aes(x=pulse1, y=pulse2, color=ran, shape=gender), size=4, alpha=0.5 )
p + facet_grid( gender ~ exercise )

Retrieving JSON weather list (exercise) 📖

Needs: “Lists”.
Gives: as.POSIXct, “epoch”, plot, “fromJSON”, “rjson”.
Practices: str, plot, “fromJSON”, as.POSIXct.

Weather data source and format

Let’s analyse the weather data from publicly available Open-meteo archive. The data can be obtained in the JSON file format (in order to see the structure of a JSON file, copy/paste to a browser tab the link provided below in the url variable).

Retrieving the data

Install the rjson package when necessary. Then, use the code below to download the weather data for Leiden:

#install.packages("rjson")
library(rjson)
url <- "https://archive-api.open-meteo.com/v1/archive?latitude=52.1583&longitude=4.4931&start_date=2024-07-01&end_date=2024-07-31&hourly=temperature_2m,relative_humidity_2m&timeformat=unixtime&timezone=Europe%2FBerlin"
ls <- rjson::fromJSON( file = url )

Study the structure of the list

  • Investigate the url variable. Note that in a certain notation the geographical latitude and longitude of Leiden are provided. Moreover, the period of time is set from 2024-07-01 to 2024-07-31. Finally, the type of data to be downloaded is specified: hourly temperature and relative humidity at 2m above the ground. The timezone is set to Europe/Berlin. Look at Historical Weather API if you want to find more details.
  • Investigate and understand the structure of the ls object.
  • Find a way to extract the numerical vector of temperatures.
  • Find the temperature unit in another element of the ls list.
  • Note, that the time is provided in the Unix epoch format (numbers like 1719784800... are seconds since 1970-01-01 00:00:00 UTC). Extract the vector of times and find a way to convert the time to the human-readable format (as.POSIXct).
  • Finally, plot the temperature against time using the basic plot function (find out how to draw lines, not points).
Solution…
str(ls)
ls$hourly$temperature_2m
ls$hourly_units$temperature_2m
as.POSIXct(ls$hourly$time)
plot(as.POSIXct(ls$hourly$time), ls$hourly$temperature_2m, type="l")

A function to retrieve weather data (exercise) 📖

Needs: “function”, “fromJSON”, as.POSIXct, tibble, stopifnot.
Gives: paste0, is.data.frame, is.numeric.
Practices: “function”, “fromJSON”, as.POSIXct, tibble, paste0, stopifnot, is.data.frame, is.numeric.

Prerequisite

First, solve the exercise “Retrieving JSON weather list”.

Build the function

Build a function getWeatherData which downloads the weather data for a given location and time period. The function should take the following arguments: long (longitude), lat (latitude), startDate, and endDate. The function should return a tibble with columns: time, temperature, and humidity.

Hint: To build url use the paste0 function to merge fixed parts with the parameters provided in long, lat, startDate and endDate.

Solution…
getWeatherData <- function( long, lat, startDate, endDate ) {
  url <- paste0( 
    "https://archive-api.open-meteo.com/v1/archive?",
        "latitude=", lat, 
        "&longitude=", long, 
        "&start_date=", startDate, 
        "&end_date=", endDate, 
        "&hourly=temperature_2m,relative_humidity_2m&timeformat=unixtime&timezone=Europe%2FBerlin"
  )
  #message( "Accessing URL: ", url )
  ls <- rjson::fromJSON( file = url )
  tibble( 
    time = as.POSIXct(ls$hourly$time), 
    temperature = ls$hourly$temperature_2m, 
    humidity = ls$hourly$relative_humidity_2m 
  )
}

Here is an example of the function usage and a fragment of the expected result:

d <- getWeatherData( long=4.4931, lat=52.1583, startDate="2024-07-01", endDate="2024-07-31" )
head(d)
# A tibble: 6 × 3
  time                temperature humidity
  <dttm>                    <dbl>    <dbl>
1 2024-07-01 01:00:00        13.4       82
2 2024-07-01 02:00:00        12.8       83
3 2024-07-01 03:00:00        13.2       83
4 2024-07-01 04:00:00        13.1       84
5 2024-07-01 05:00:00        13.2       83
6 2024-07-01 06:00:00        13.2       84

Implement some tests

You see by eye that the result in d is somehow fine. However, in the future you may be improving the function and accidentally you may destroy the output. To prevent this, implement some tests to check whether the function works correctly.

Use the stopifnot function to check whether:

  • The output is a data frame (is.data.frame; a tibble is a data frame).
  • The output has the columns time, temperature, and humidity.
  • The columns temperature and humidity are numeric (is.numeric).
Solution…
d <- getWeatherData( long=4.4931, lat=52.1583, startDate="2024-07-01", endDate="2024-07-31" )
stopifnot( is.data.frame(d) )
stopifnot( all( c("time", "temperature", "humidity") %in% names(d) ) )
stopifnot( is.numeric(d$temperature) )
stopifnot( is.numeric(d$humidity) )

Plot weather data for several cities (exercise) 📖

Needs: ggplot, aes, facet_grid, theme_bw, scale_color_continuous, xlab, ylab, 🔴lapply, “function”.
Gives: geom_line, bind_rows.
Practices: “function”, 🔴lapply, bind_rows, ggplot, aes, geom_line, facet_grid, theme_bw, scale_color_continuous, xlab, ylab.

Prerequisite

First, solve the exercise “A function to retrieve weather data”. Extract the function getWeatherData from the solution and use it in the code below.

Goal

The goal is to produce the function plotWeather, which produces a plot of the temperature and humidity for several cities as shown below:

  • The function should take an argument city2coords: a list providing names and geographical coordinates of the cities (see the example).
  • The startDate and endDate arguments should specify the time period for which the data should be retrieved.
  • The function should return the plot. Use geom_line. The order of the cities in the plot should be the same as in the city2coords list.
  • The getWeatherData function should be used to retrieve the data for each city.
  • Use the lapply function to iterate over the cities and retrieve the weather data as a list of tibbles. Put this list as the argument to the bind_rows function to merge the tibbles into one large tibble for all cities. Use the .id argument of bind_rows to add a column with the city names.

The plot

The following code is expected to work and reproduce the shown plot. Once your implementation is ready, try it with two additional cities of your choice.

Solution…
plotWeather <- function( city2coords, startDate, endDate ) {
  city2weatherData <- lapply( city2coords, function( coords ) {
    getWeatherData( 
      long=coords$long, lat=coords$lat, 
      startDate=startDate, endDate=endDate 
    )
  } )
  
  d <- bind_rows( city2weatherData, .id="city" )
  ggplot( d, aes(x=time, y=temperature, color=humidity ) ) + 
    geom_line() + 
    facet_grid( city ~ . ) +
    theme_bw() +
    scale_color_continuous( 
      low="yellow", high="blue", name="Humidity\n[rel%]" 
    ) +
    xlab( "Date/time" ) +
    ylab( "Temperature [°C]" )
}
city2coords <- list(
  Leiden = list( long=4.4931, lat=52.1583 ),
  Amsterdam = list( long=4.8897, lat=52.374 ),
  "Kraków" = list( long=19.945, lat=50.0647 )
)

plotWeather( city2coords, startDate="2024-07-01", endDate="2024-08-01" )

Real Data Analysis: Example data analysis problem (cont.) (exercise) 📖

Files: rda-NL.rds, rda-UK.csv, rda-PL.tsv, rda-US.xlsx, rda-genes.csv.

Continue with the task provided before.

Data manipulation 3/3

Map and user-defined functions (lecture) 👩‍🏫

1. User-defined Function

A user-defined function is a custom function created by the user to perform specific tasks that may not be available in R’s built-in functions. User-defined functions are reusable and are called by their given name. They lead to a more organised code and are easier to maintain.

# Add 1 to the vector x
add_one <- function(x) {
  # function body
  x + 1
}
# function call
add_one(2:3)
[1] 3 4

2. Anonymous Functions

  • What is an Anonymous Function?
    • An anonymous function (also called a “lambda” function) is a function without a name.
    • They are useful for small, one-time tasks.
  • Syntax in R:
    • Anonymous functions in R use function without assigning it to a variable.
    • Example: function(x) x + 1
  • When to Use Anonymous Functions:
    • For simple operations in single-use scenarios.
    • Inside other functions (e.g., mapping functions).
    • They make code shorter and more readable.

**Basic Syntax* for example, take the function definition of add_one without naming it and pass arguments to it in composition:

(function(x) x + 1)(2:3)
[1] 3 4

Anonymous functions are best for short and straightforward tasks where iteration is involved.

This is best shown in the next section with the purrr::map family of functions.

3. tidyverse purrr map and pmap functions

map(.x, .f, ...) is part of purrr, a sub-package of tidyverse. It applies function .f to each element of a list or vector .x and returns a list.

Example: doubling numbers in a vector with an anonymous function.

map(1:5, function(x) x * 2)
[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

[[4]]
[1] 8

[[5]]
[1] 10

pmap is the ‘parallelised’ version of map supporting functions with more than 1 argument. It applies a function to elements across multiple lists or columns in a data frame. The notion parallel is used to signify element-wise processing.

Example: adding numbers from three vectors:

# pmap can handle lists and dataframes as input and returns a list.
#
# list
pmap(list(1:5,    # x
          6:10,   # y
          11:15), # z 
     function(x, y, z) x + y + z) 

# dataframe
pmap(tibble(a = 1:5, b = 6:10, c = 11:15), 
         function(a, b, c) a + b + c)

The map and pmap functions have specialised versions to return vectors of atomic types.They are suffixed with _int, _dbl, _lgl etc. for integer, double and logical respectively as returned types.

# Atomic vector as result
#
# map : return type double
map_dbl(1:5, function(x) x * 2)
# map : return type integer
map_int(1:5, function(x) x * 2)

# pmap
pmap_dbl(list(1:5,  # x
              6:10,  # y
              11:15), # z 
         function(x, y, z) x + y + z)

4. Anonymous functions, shorthand with ’' the ~ notation

map example:

map(1:5, ~ .x * 2)                         # <=> map(1:5, function(x) x * 2)
map(1:5, \(made_up_name) made_up_name * 2) # '\' shorthand for 'function'

Here, .x represents the argument to the anonymous function.

pmap example:

pmap_dbl(list(1:5, 6:10, 11:15), ~ .x)                  # 1st input 
pmap_dbl(list(1:5, 6:10, 11:15), ~ .x + .y)             # parallel sum of 1st and 2nd element 
pmap_dbl(list(1:5, 6:10, 11:15), ~ .x + .y + .z)        # error: only .x and .y are defined.
pmap_dbl(list(1:5, 6:10, 11:15), ~ ..1 + ..2 + ..3)     # NOT RECOMMANDED
pmap_dbl(list(1:5, 6:10, 11:15), function(a,b,c) a+b+c) # anonymous function
pmap_dbl(list(1:5, 6:10, 11:15), \(a,b,c) a+b+c)        # '\' shorthand for 'function' 

Map (exercise) 📖

  1. Add a new column row_sum to the table d holding the sum of all values in each row.
d <- tibble(a = 1:3, b = 4:6, c = 7:9)
  1. map vs pmap

Take the vector v and tibble t:

v <- 1:4
t <- tibble(a = 1:4, b = 5:8)

Which of the following statements are true?

  1. map(v, ~ .x * 2) and pmap(v, ~ .x * 2) will give the same output.
  2. pmap(t, sum) will compute the sum of a and b for each row in t.
  3. map(t, sum) is equivalent to pmap(t, sum) when applied to a data frame.
  4. map() always returns a list, while pmap() returns a tibble by default.
  1. In msleep mammals sleep dataset (tidyverse), find out which variables have missing values?

  2. Consider a simple list of grades:

grades <- list(
  Bob = c( 6, 7, NA, 7, NA, 5, 5, 8 ),
  Alice = c( 8, 9, 7.5, 8, 9 ),
  Daisy = c( 8, NA, 6.5, 6.5, 7, 8, 7 ),
  Charlie = c( 9, 7, 9 ),
  Eve = c( 5.5 )
)
  1. Calculate the mean grade of each student.
  2. Calculate the mean grade but now omit the missing. The output is a list.
  3. Repeat (b) with the output as a named vector.
  4. Calculate the number of grades of each student (including the missing ones). This should be a named numerical vector.
  5. Produce a numerical vector with counts of missing values for each student.
  6. Produce a logical telling whether a name has any missing grades.
  7. Reorder the list of grades so, that the names are in the alphabetical order, and the grades are sorted from the largest to the smallest. Don’t drop NA values (just keep them at the end).
Solution…
# 1)
d |> mutate(row_sum = pmap_dbl(d, function(a, b, c) a + b + c))

# 2) b

# 3)

map_lgl(msleep, function(x) any(is.na(x))) # alternatively use anyNA (see ?NA)

# 4)
grades <- list(
  Bob = c( 6, 7, NA, 7, NA, 5, 5, 8 ),
  Alice = c( 8, 9, 7.5, 8, 9 ),
  Daisy = c( 8, NA, 6.5, 6.5, 7, 8, 7 ),
  Charlie = c( 9, 7, 9 ),
  Eve = c( 5.5 )
)

# 4.a)
map(grades, mean) 
# 4.b)
map(grades, mean, na.rm=TRUE) 
# 4.c)
map_dbl(grades, mean, na.rm=TRUE) 
# 4.d)
map_dbl(grades, length) 
# 4.e)
map_dbl(map(grades,is.na), sum) 
# 4.f)
map_lgl(map(grades,is.na), any) 
# 4.g)
map(grades[sort(names(grades))], sort, decreasing=TRUE, na.last=TRUE )

Iterating over list elements with lapply and sapply (exercise) 📖

Needs: list, mean, length, sort.
Gives: lapply, sapply, any.
Practices: lapply, sapply, any.

Base-R provides mechanisms to apply a function to each element of a list (or another iterable object, e.g. a vector) and automatically build a list with results. If the iterable object has names, they are copied to the output list.

Example list

Let’s consider a simple list of grades:

grades <- list(
  Bob = c( 6, 7, NA, 7, NA, 5, 5, 8 ),
  Alice = c( 8, 9, 7.5, 8, 9 ),
  Daisy = c( 8, NA, 6.5, 6.5, 7, 8, 7 ),
  Charlie = c( 9, 7, 9 ),
  Eve = c( 5.5 )
)

Calculations

  • Find out how to use the lapply function to calculate the mean of the grades of each student.
  • Find out how in lapply to pass additional arguments for the mean function, so you can calculate the mean omitting the missing values.
  • Replace lapply with sapply. Observe how the type of the result changed.
  • Calculate the number of grades of each student (including the missing ones). This should be a named numerical vector.
  • Produce a numerical vector with counts of missing values for each student. Hint: use apply functions twice (with is.na and sum).
  • Produce a logical telling whether a name has any missing grades. Hint: understand functions any or all.
  • Reorder the list of grades so, that the names are in the alphabetical order, and the grades are sorted from the largest to the smallest. Don’t drop NA values (just keep them at the end).
Solution…
lapply( grades, mean )
lapply( grades, mean, na.rm = TRUE ) # additional arguments to mean
sapply( grades, mean, na.rm = TRUE )
sapply( grades, length )
sapply( lapply( grades, is.na ), sum )
sapply( lapply( grades, is.na ), any )

n2g <- lapply( grades, sort, decreasing=TRUE, na.last=TRUE )
n2g[ sort( names( n2g ) ) ]

Merge tables (lecture) 👩‍🏫

*_join(...) is a family of functions for combining two tibbles on common variable(s) called key.

For example, take a group of people (n=7) for which we have collected age and height data:

# height 
df1 <- tibble(name = c("Isa","Jaylinn","Mila","Milas","Yara"),
           height = c(160, 172, 182, 157, 162)) 
df1
# A tibble: 5 × 2
  name    height
  <chr>    <dbl>
1 Isa        160
2 Jaylinn    172
3 Mila       182
4 Milas      157
5 Yara       162
# age
df2 <- tibble(name = c("Fiene","Jaylinn","Mila","Noah","Yara"),
           age = c(20,24,17,23,17)) 
df2
# A tibble: 5 × 2
  name      age
  <chr>   <dbl>
1 Fiene      20
2 Jaylinn    24
3 Mila       17
4 Noah       23
5 Yara       17

The information here is incomplete, for some we have only height and for some only age. Now we want to combine these tibbles into a single table with variables name, age and height. We can do this with a join function which is able to combine observations by matching on the common variable(s) between the two tibbles also known as keys. Matching is done by one or more variables, in this case the variable name.

Note that for a proper joining of data, the matched variable(s), also known as the key, must be unique in both tibbles.

inner_join Return all rows in df1 where there are matching values of name in df2 and all columns in both df1 and df2:

inner_join(df1,df2, by = "name")
# A tibble: 3 × 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Jaylinn    172    24
2 Mila       182    17
3 Yara       162    17

left_join Return all rows from df1 and all columns in both df1 and df2, NA for missing values in df2:

left_join(df1,df2, by = "name")
# A tibble: 5 × 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Isa        160    NA
2 Jaylinn    172    24
3 Mila       182    17
4 Milas      157    NA
5 Yara       162    17

full_join All rows and all columns in df1 and df2.

full_join(df1,df2, by = "name")
# A tibble: 7 × 3
  name    height   age
  <chr>    <dbl> <dbl>
1 Isa        160    NA
2 Jaylinn    172    24
3 Mila       182    17
4 Milas      157    NA
5 Yara       162    17
6 Fiene       NA    20
7 Noah        NA    23

Bind tables (lecture) 👩‍🏫

Often we have different data sets with i) the same set of variables or ii) same set of observations but different variables which we would like to combine:

  • bind_rows(…) function creates a new tibble from a set of tibbles with not necessarily common variables.
    • treats tibbles as a collection of unordered variables.
  • bind_cols(…) function creates a new tibble from a set of tibbles with common observations.
    • expects the same number of observations in each tibble
t1 <- tibble(a=1:3, b=letters[1:3])
t2 <- tibble(a=4:6, b=letters[4:6])
t3 <- tibble(a=6:8)
bind_rows(t1,t2)  # compatible
# A tibble: 6 × 2
      a b    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     4 d    
5     5 e    
6     6 f    
bind_rows(t1,t3)  # incompatible
# A tibble: 6 × 2
      a b    
  <int> <chr>
1     1 a    
2     2 b    
3     3 c    
4     6 <NA> 
5     7 <NA> 
6     8 <NA> 

To combine variables column-wise use bind_cols:

t4 <- tibble(b=letters[6:8])
bind_cols(t3,t4)
# A tibble: 3 × 2
      a b    
  <int> <chr>
1     6 f    
2     7 g    
3     8 h    

Note that the user is responsible for the order of observations in each tibble.

Merge tables (exercise) 📖

Files: pulseNA.csv, survey.csv.
Needs: 🔴“LcJoin”, “survey.csv”, “pulseNA.csv”, 🔴“LcMutate”, 🔴tidyverse.

  1. Unique keys To consistently merge tables with the given key(s) it is important to make sure that the chosen key(s) contain no duplicates.
  • 1.1) Take a look at the function distinct (?distinct) and use it to check whether the variable name in the pulse dataset has no duplicates and each name can uniquely identify an observation(row) in the table.

  • 1.2) Do the same test as in 1.1 with the function duplicated (?duplicated)

  1. The variable name in the survey dataset has duplicates, can you find other variables that would make it distinct when combined with name?

  2. Print the rows with duplicate names in pulse and survey dataset.

Solution…
# 1
#
# 1.1
#
# There are less distinct names than total nr. of observations, therefore some 
# names must be duplicated. 
#
check_names <- list( nr_of_distinct_names = pulse |>  distinct(name) |>  nrow(),  
   nr_of_names  = pulse |> nrow())

# Test equality: if FALSE then there are duplicates otherwise not. 
check_names$nr_of_distinct_names == check_names$nr_of_names 

# 1.2
#
pulse |> summarise(nr_of_distinct_names = sum( ! duplicated(name)),  nr_of_names = n() )

# 2
(survey |> distinct(name,height) |> nrow() ) == (survey |> nrow())
(survey |> distinct(name,span1) |> nrow() ) == (survey |> nrow())
(survey |> distinct(name,age) |> nrow() ) == (survey |> nrow())

# 3
# pulse
pulse_dupl_names <- pulse |>  filter(duplicated(name)) |> pull(name)  
pulse |>  filter(name %in% pulse_dupl_names) |>  arrange(name)
# survey
survey_dupl_names <- survey |>  filter(duplicated(name)) |> pull(name)  
survey |>  filter(name %in% survey_dupl_names) |>  arrange(name)

Bind tables (exercise) 📖

  1. In the survey dataset what is the mean height of the first 15 and the 30 observations combined?
  2. Convert the list below to a tibble with variables name and height:
heights  <- list(Andre=182.88, Benjamin=175.26, Alyson=173, Edward=175)
  1. How many duplications of (name,height) do you find in the combined survey and pulse dataset?
Solution…
#
survey <- read_csv("survey.csv", show_col_types = FALSE)
pulse  <- read_csv("pulseNA.csv", show_col_types = FALSE)

# 1)
bind_rows(survey |>  head(15), survey |>  tail(30)) |> 
  summarise(mean_height=mean(height, na.rm=TRUE))

# 2) 
heights  <- list(Andre=182.88, Benjamin=175.26, Alyson=173, Edward=175)
lapply(names(heights), 
       function(name_) tibble(name=name_, height=heights[[name_]])) |>  
  bind_rows()

# 3) 
bind_rows(pulse |> select(name, height), survey |> select(name, height)) |>
    duplicated() |>  
    sum()

Date/Time (lecture) 👨‍🏫

lubridate is a sub-package in tidyverse for working with dates and times in R. Handling date-time data can be tricky due to different formats, time zones and lubridate simplifies it by providing functions for parsing, extracting, and manipulating dates and times.

1. Parsing Dates and Times

lubridate has the following functions to convert date-time strings into proper date or datetime objects in R:

  • ymd(): Parses dates in “Year-Month-Day” format.
  • mdy(): Parses dates in “Month-Day-Year” format.
  • dmy(): Parses dates in “Day-Month-Year” format.

Similarly, for date-times (dates with times included), lubridate offers:

  • ymd_hms(): For “Year-Month-Day Hour:Minute:Second”.
  • mdy_hm(): For “Month-Day-Year Hour:Minute”.

Example:

library(lubridate)
date1 <- dmy("18-11-2024")
date1
[1] "2024-11-18"
date1 |> class()
[1] "Date"
date2 <- dmy_hms("21-11-2024 11:15:00")
date2
[1] "2024-11-21 11:15:00 UTC"

2. Extracting Components of Dates and Times

Once you have a date or date-time object, you often need to extract components like the year, month, or day. lubridate provides straightforward functions for this:

Example:

# Extract components
year(date1)    
month(date1)   
day(date1)     
hour(date2)    
[1] 2024
[1] 11
[1] 18
[1] 11

3. Modifying Dates and Times

You can also modify parts of a date-time object directly. For instance, if you want to change the month or add days to a date, lubridate provides functions like update() and simple arithmetic.

  • update(): Modifies specific parts of a date-time object.
  • Arithmetic: Adding or subtracting days, months, etc.

Example:

# Changing the month to December
date1 <- update(date1, month = 12)

# Adding "0001-01-03 01:02:25" to a date1:
date1 + years(1) + months(1) + days(3) + hours(1) + minutes(2) + seconds(25)

Note: years, months, etc. will create periods of date/time for date calculation, not to be confused with year, month, etc. which are used for extraction.

4. Working with Intervals and Durations

For date-based calculations, lubridate provides intervals, durations, and periods to help calculate differences between dates or add specific time amounts.

  • interval(): Creates an interval between two dates.
  • duration(): Measures exact time in seconds (useful for precise calculations).
  • period(): Measures human-readable time units like days, months, or years.

Example:

# interval between two dates
intrvl1 <- interval( dmy_hms("18-11-2024 11:15:00"), dmy_hms("22-11-2024 11:15:00"))
intrvl1
[1] 2024-11-18 11:15:00 UTC--2024-11-22 11:15:00 UTC
intrvl1 |> class()
[1] "Interval"
attr(,"package")
[1] "lubridate"
# duration in days
as.duration(intrvl1) /ddays(1) 
[1] 4
# period
as.period(intrvl1)
[1] "4d 0H 0M 0S"

Summary

lubridate simplifies date-time manipulation provides:

  • Easy parsing functions to convert strings to dates.
  • Extraction tools for getting specific date-time components.
  • Modification options to adjust dates and times.
  • Intervals and durations for date calculations.

Date/Time (exercise) 📖

  1. Date/Time formats Using the function format one can format dates/times, for example given new_years_eve of type Date:
new_years_eve # December 31, 2024 at 6 PM
[1] "2024-12-31 18:00:00 UTC"
format(new_years_eve, "%Y-%m-%d")
[1] "2024-12-31"

where %Y, %m and %d stands for year, month and day respectively. See section under ?strptime for more formatting options.

Exercise format new_years_eve in the following formats:

[1] "Tuesday, 31 December 2024"
[1] "Tue, Dec 31"
[1] "2024-12-31 18:00:00"
[1] "2024-12-31 06:00 PM"
[1] "Day 366 of the year, Week 52"
  1. storms dataset is part of the tidyverse package. Create a new dataframe storms_dates where the date related variables are replaced with variable date of type Date.

  2. List top 3 storms with longest period in storms_dates. Note that the storms are monitored and reported often every 6 hours but also in other intervals.

library(tidyverse)
storms |> str()
tibble [19,537 × 13] (S3: tbl_df/tbl/data.frame)
 $ name                        : chr [1:19537] "Amy" "Amy" "Amy" "Amy" ...
 $ year                        : num [1:19537] 1975 1975 1975 1975 1975 ...
 $ month                       : num [1:19537] 6 6 6 6 6 6 6 6 6 6 ...
 $ day                         : int [1:19537] 27 27 27 27 28 28 28 28 29 29 ...
 $ hour                        : num [1:19537] 0 6 12 18 0 6 12 18 0 6 ...
 $ lat                         : num [1:19537] 27.5 28.5 29.5 30.5 31.5 32.4 33.3 34 34.4 34 ...
 $ long                        : num [1:19537] -79 -79 -79 -79 -78.8 -78.7 -78 -77 -75.8 -74.8 ...
 $ status                      : Factor w/ 9 levels "disturbance",..: 7 7 7 7 7 7 7 7 8 8 ...
 $ category                    : num [1:19537] NA NA NA NA NA NA NA NA NA NA ...
 $ wind                        : int [1:19537] 25 25 25 25 25 25 25 30 35 40 ...
 $ pressure                    : int [1:19537] 1013 1013 1013 1013 1012 1012 1011 1006 1004 1002 ...
 $ tropicalstorm_force_diameter: int [1:19537] NA NA NA NA NA NA NA NA NA NA ...
 $ hurricane_force_diameter    : int [1:19537] NA NA NA NA NA NA NA NA NA NA ...
Solution…
# 1) 
new_years_eve <- dmy_hms("31/12/2024 18:00:00")
format(new_years_eve, "%A, %d %B %Y")
format(new_years_eve, "%a, %b %d")
format(new_years_eve, "%Y-%m-%d %H:%M:%S")
format(new_years_eve, "%Y-%m-%d %I:%M %p")
format(new_years_eve, "Day %j of the year, Week %U")

# 2)
storms_dates <- storms |> 
  mutate(date=dmy_hms(
    paste(paste(day,month,year,sep="-"), 
          paste(hour,"00","00",sep=":")) 
    )
  ) |> 
  select(-year,-month,-day,-hour) |> 
  select(name,date,everything())

# 3) 
# Note: this solution does not catch storms crossing over years.     
storms_dates |> 
  mutate(year=year(date)) |> 
  group_by(name,year) |>
  mutate(period=as.period(as.interval(min(date),max(date)))) |> 
  distinct(name,year,period) |> 
  ungroup() |> 
  arrange(desc(period)) |> 
  head(3)

Self-Study Assignment

Real Data Analysis: Example data analysis problem (cont.) (exercise) 📖

Files: rda-NL.rds, rda-UK.csv, rda-PL.tsv, rda-US.xlsx, rda-genes.csv.

Continue with the task provided before.